Remove a bunch of places where we create ServerStateNode. We were
creating a SSN even though the server was long dead and processed.
The revived SSN was messing up the little dance we do unassigning
procedures. In particular, in UnassignProcedure, the check for a
dead server inside in isLogSplittingDone returns true -- we can
proceed because server is dead -- fails if an SSN exists.
We were creating SSN when we didn't need it as well as inadvertently.
M hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
Print serverstatenode when reporting expiration. Helps debugging.
Make moveFromOnlineToDeadServers return if server online or not.
M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java
Make do w/ serverName in place of serverNode in a few places.
In waitServerReportEvent, create a ServerStateNode if none though we
should not have to at this point; to figure out later: TODO.
addRegionToServer no longer automatically calls create SSN
so do explicit create processing load meta and the region
is OPEN so we can associate OPEN regions with the SSN.
Do not schedule an SCP if server is not online, not in fs, and not in
dead servers. No point (and there may be cases where server is long
gone but hbase:meta still refers to it though it has not carried
regions in a long time; running an assign/unassign against such a
server will fail because it is not there but SCP won't clean up
the outstanding hung RPC because our region is not on the long-gone
server).
M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionStates.java
Just cleanup. Make it so addRegionToServer and remove can deal if no SSN.
M hbase-server/src/main/java/org/apache/hadoop/hbase/master/MasterWalManager.java
Add isWALDirectoryNameWithWALS utility.
For RIT Duration, do better than print ms/seconds. Remove redundant UI
column dedicated to duration when we log it in the status field too.
Make bypass log at INFO level.
Make it so on complete of subprocedure, we note count of outstanding
siblings so we have a clue how much further the parent has to go before
it is done (Helpful when hundreds of servers doing SCP).
Have the SCP run the AP preflight check before creating an AP; saves
creation of thousands of APs during fixup.
Don't log tablename three times when reporting remote call failed.
If lock is held already, note who has it. Also log after we get lock
or if we have to wait rather than log on entrance though we may
later have to wait (or we may have just picked up the lock).
Signed-off-by: Mike Drob <mdrob@apache.org>
Adds override to assigns and unassigns. Changes bypass 'force'
to align calling the param 'override' instead.
Adds recursive to 'bypass', a means of calling bypass on
parent and its subprocedures (usually bypass works on
leaf nodes rippling the bypass up to parent -- recursive
has us work in the opposite direction): EXPERIMENTAL.
bypass on an assign/unassign leaves region in RIT and the
RegionStateNode loaded with the bypassed procedure. First
implementation had assign/unassign cleanup leftover state.
Second implementation, on feedback, keeps the state in place
as a fence against other Procedures assuming the region entity,
and instead adds an 'override' function that hbck2 can set on
assigns/unassigns to override the fencing.
Note that the below also converts ProcedureExceptions that
come out of the Pv2 system into DoNotRetryIOEs. It is a
little awkward because DNRIOE is in client-module, not
in procedure module. Previous, we'd just keep retrying
the bypass, etc.
M hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/Procedure.java
Have bypass take an environment like all other methods so subclasses.
Fix javadoc issues.
M hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/ProcedureExecutor.java
Javadoc issues. Pass environment when we invoke bypass.
M hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
Rename waitUntilNamespace... etc. to align with how these method types
are named elsehwere .. i.e. waitFor rather than waitUntil..
M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionTransitionProcedure.java
Cleanup message we emit when we find an exisitng procedure working
against this entity.
Add support for a force function which allows Assigns/Unassigns force
ownership of the Region entity.
A hbase-server/src/test/java/org/apache/hadoop/hbase/master/assignment/TestRegionBypass.java
Test bypass and force.
M hbase-shell/src/main/ruby/shell/commands/list_procedures.rb
Minor cleanup of the json output... do iso8601 timestamps.
This reverts commit b96905d1df.
i.e. a revert of a revert so a reapplication!
Revert so I can add signed-off-by....
Signed-off-by: Allan Yang <allan163@apache.org>
bypass on an assign/unassign leaves region in RIT and the
RegionStateNode loaded with the bypassed procedure. First
implementation had assign/unassign cleanup leftover state.
Second implementation, on feedback, keeps the state in place
as a fence against other Procedures assuming the region entity,
and instead adds an 'override' function that hbck2 can set on
assigns/unassigns to override the fencing.
Note that the below also converts ProcedureExceptions that
come out of the Pv2 system into DoNotRetryIOEs. It is a
little awkward because DNRIOE is in client-module, not
in procedure module. Previous, we'd just keep retrying
the bypass, etc.
M hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/Procedure.java
Have bypass take an environment like all other methods so subclasses.
Fix javadoc issues.
M hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/ProcedureExecutor.java
Javadoc issues. Pass environment when we invoke bypass.
M hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
Rename waitUntilNamespace... etc. to align with how these method types
are named elsehwere .. i.e. waitFor rather than waitUntil..
M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionTransitionProcedure.java
Cleanup message we emit when we find an exisitng procedure working
against this entity.
Add support for a force function which allows Assigns/Unassigns force
ownership of the Region entity.
A hbase-server/src/test/java/org/apache/hadoop/hbase/master/assignment/TestRegionBypass.java
Test bypass and force.
M hbase-shell/src/main/ruby/shell/commands/list_procedures.rb
Minor cleanup of the json output... do iso8601 timestamps.
Remove unused methods from Sleeper (its ok, its @Private).
Remove notion of startTime from Sleeper handling (it is is unused).
Allow passing in how long to sleep so can maintain externally.
In HRS, use a RetryCounter to calculate backoff sleep time for when
reportForDuty is failing against a struggling Master.
Add a check for hbase:meta being online before we go to read it.
If not online, move into a holding-pattern until rectified, probably
by external operator.
Incorporates bulk of patch made by Allan Yang over on HBASE-21035.
M hbase-common/src/main/java/org/apache/hadoop/hbase/util/RetryCounterFactory.java
Add a Constructor for case where retries are for ever.
M hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
Move stuff around so that the first hbase:meta read is the AM#loadMeta.
Previously, checking table state and/or favored nodes could end up
trying to read a meta that was not onlined holding up master startup.
Do similar for the namespace table. Adds new methods isMeta and
isNamespace which check that the regions/tables are online.. if not,
we wait logging with a back-off that assigns need to be run.
Signed-off-by: Allan Yang <allan163@apache.org>
Signed-off-by: Duo Zhang <zhangduo@apache.org>
HBASE-21113 Apply the branch-2 version of HBASE-21095, The timeout retry
logic for several procedures are broken after master restarts
I applied the patch HBASE-21095 and then reverted it so could apply the
patch as HBASE-21113 (by reverting the HBASE-21095 revert but pushing
with this message!).
This reverts commit 4978db8102.
Write the hbase-1.x hbck1 lock file to block out hbck1 instances writing
state to an hbase-2.x cluster (could do damage).
Set hbase.write.hbck1.lock.file to false to disable this writing.
Look for the particular case where RS does the close of region w/o
involving Master and log special message in this case. Dodgy. But
until we have Master run shutdown of all regions, better than
the message we currently show.
With this change if hbase.wal.meta_provider is not explicitly set,
it uses whatever set with hbase.wal.provider. this change avoids a use
case of unexpectedly using two different providers when only
hbase.wal.provider is set to non-default but not hbase.wal.meta_provider.
This change also include document (architecture.adoc) update
Also, this is a port from master to branch-2
Signed-off-by: Zach York <zyork@apache.org>
Signed-off-by: Michael Stack <stack@apache.org>
Signed-off-by: Sean Busbey <busbey@apache.org>
Signed-off-by: Duo Zhang <Apache9@apache.org>
This patch switches to the builder pattern by adding a helper method.
It also checks to ensure that the pattern is available (i.e. that
HBase is running on a hadoop version that supports it).
Amending-Author: Mike Drob <mdrob@apache.org>
Signed-off-by: tedyu <yuzhihong@gmail.com>
Signed-off-by: zhangduo <zhangduo@apache.org>
Make the #copyCellInto method smaller so it inlines; we do it by
checking for the common type early and then taking a code path
that presumes ByteBufferExtendedCell -- avoids checks.
- -jar parameter now accepts multiple jar files and directories of jar files.
- observer classes can be verified by -class option.
- -table parameter was added to check table level coprocessors.
- -config parameter was added to obtain the coprocessor classes from
HBase cofiguration.
- -scan option was removed.
Signed-off-by: Mike Drob <mdrob@apache.org>
With prefetch-on-open enabled, the task doing the prefetching was using
non-positional (i.e. streaming) reads. If the main (non-prefetch) thread
was also using non-positional reads, these two would conflict, because
inputstreams are not thread-safe for non-positional reads.
In the case of an encrypted filesystem, this could cause JVM crashes,
etc, as underlying cipher buffers were freed underneath the racing
threads. In the case of a non-encrypted filesystem, less severe errors
would be thrown. The included unit test reproduces the latter case.
(cherry picked from commit 025ddce868)
Signed-off-by: Todd Lipcon <todd@cloudera.com>
Also changes modify table operations to help the case where a MTP spans
two master, avoiding the sanity-checks propagating back to the client
unnecessarily.
Signed-off-by: Josh Elser <elserj@apache.org>
Signed-off-by: Michael Stack <stack@apache.org>
ModifyTableProcedure is using MoveRegionProcedure in a way
that was unintended from the original implementation. As such,
we have to guard against certain usages of it. We know we can
re-open OPEN regions, but regions in OPENING will similarly
soon be OPEN (thus, we want to reopen those regions too).
Signed-off-by: Michael Stack <stack@apache.org>
Signed-off-by: zhangduo <zhangduo@apache.org>
* modify the jar checking script to take args; make hadoop stuff optional
* separate out checking the artifacts that have hadoop vs those that don't.
* * Unfortunately means we need two modules for checking things
* * put in a safety check that the support script for checking jar contents is maintained in both modules
* * have to carve out an exception for o.a.hadoop.metrics2. :(
* fix duplicated class warning
* clean up dependencies in hbase-server and some modules that depend on it.
* allow Hadoop to have its own htrace where it needs it
* add a precommit check to make sure we're not using old htrace imports
Conflicts:
hbase-backup/pom.xml
hbase-checkstyle/src/main/resources/hbase/checkstyle-suppressions.xml
Signed-off-by: Mike Drob <mdrob@apache.org>
Cannot go to latest (8.9) yet due to
https://github.com/checkstyle/checkstyle/issues/5279
* move hbaseanti import checks to checkstyle
* implment a few missing equals checks, and ignore one
* fix lots of javadoc errors
Signed-off-by: Sean Busbey <busbey@apache.org>
A reattempt at fixing HBASE-20173 [AMv2] DisableTableProcedure concurrent to ServerCrashProcedure can deadlock
The scenario is a SCP after processing WALs, goes to assign regions that
were on the crashed server but a concurrent Procedure gets in there
first and tries to unassign a region that was on the crashed server
(could be part of a move procedure or a disable table, etc.). The
unassign happens to run AFTER SCP has released all RPCs that
were going against the crashed server. The unassign fails because the
server is crashed. The unassign used to suspend itself only it would
never be woken up because the server it was going against had already
been processed. Worse, the SCP could not make progress because the
unassign was suspended with the lock on a region that it wanted to
assign held making it so it could make no progress.
In here, we add to the unassign recognition of the state where it is
running post SCP cleanup of RPCs. If present, unassign moves to finish
instead of suspending itself.
Includes a nice unit test made by Duo Zhang that reproduces nicely the
hung scenario.
M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/FailedRemoteDispatchException.java
Moved this class back to hbase-procedure where it belongs.
M hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/NoNodeDispatchException.java
M hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/NoServerDispatchException.java
M hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/NullTargetServerDispatchException.java
Specializiations on FRDE so we can be more particular when we say there
was a problem.
M hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/RemoteProcedureDispatcher.java
Change addOperationToNode so we throw exceptions that give more detail
on issue rather than a mysterious true/false
M hbase-protocol-shaded/src/main/protobuf/MasterProcedure.proto
Undo SERVER_CRASH_HANDLE_RIT2. Bad idea (from HBASE-20173)
M hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
Have expireServer return true if it actually queued an expiration. Used
later in this patch.
M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java
Hide methods that shouldn't be public. Add a particular check used out
in unassign procedure failure processing.
M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/MoveRegionProcedure.java
Check that server we're to move from is actually online (might
catch a few silly move requests early).
M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionStates.java
Add doc on ServerState. Wasn't being used really. Now we actually stamp
a Server OFFLINE after its WAL has been split. Means its safe to assign
since all WALs have been processed. Add methods to update SPLITTING
and to set it to OFFLINE after splitting done.
M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionTransitionProcedure.java
Change logging to be new-style and less repetitive of info.
Cater to new way in which .addOperationToNode returns info (exceptions
rather than true/false).
M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/UnassignProcedure.java
Add looking for the case where we failed assign AND we should not
suspend because we will never be woken up because SCP is beyond
doing this for all stuck RPCs.
Some cleanup of the failure processing grouping where we can proceed.
TODOs have been handled in this refactor including the TODO that
wonders if it possible that there are concurrent fails coming in
(Yes).
M hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java
Doc and removing the old HBASE-20173 'fix'.
Also updating ServerStateNode post WAL splitting so it gets marked
OFFLINE.
A hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestServerCrashProcedureStuck.java
Nice test by Duo Zhang.
Signed-off-by: Umesh Agashe <uagashe@cloudera.com>
Signed-off-by: Duo Zhang <palomino219@gmail.com>
Signed-off-by: Mike Drob <mdrob@apache.org>