For RIT Duration, do better than print ms/seconds. Remove redundant UI
column dedicated to duration when we log it in the status field too.
Make bypass log at INFO level.
Make it so on complete of subprocedure, we note count of outstanding
siblings so we have a clue how much further the parent has to go before
it is done (Helpful when hundreds of servers doing SCP).
Have the SCP run the AP preflight check before creating an AP; saves
creation of thousands of APs during fixup.
Don't log tablename three times when reporting remote call failed.
If lock is held already, note who has it. Also log after we get lock
or if we have to wait rather than log on entrance though we may
later have to wait (or we may have just picked up the lock).
Signed-off-by: Mike Drob <mdrob@apache.org>
Adds override to assigns and unassigns. Changes bypass 'force'
to align calling the param 'override' instead.
Adds recursive to 'bypass', a means of calling bypass on
parent and its subprocedures (usually bypass works on
leaf nodes rippling the bypass up to parent -- recursive
has us work in the opposite direction): EXPERIMENTAL.
bypass on an assign/unassign leaves region in RIT and the
RegionStateNode loaded with the bypassed procedure. First
implementation had assign/unassign cleanup leftover state.
Second implementation, on feedback, keeps the state in place
as a fence against other Procedures assuming the region entity,
and instead adds an 'override' function that hbck2 can set on
assigns/unassigns to override the fencing.
Note that the below also converts ProcedureExceptions that
come out of the Pv2 system into DoNotRetryIOEs. It is a
little awkward because DNRIOE is in client-module, not
in procedure module. Previous, we'd just keep retrying
the bypass, etc.
M hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/Procedure.java
Have bypass take an environment like all other methods so subclasses.
Fix javadoc issues.
M hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/ProcedureExecutor.java
Javadoc issues. Pass environment when we invoke bypass.
M hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
Rename waitUntilNamespace... etc. to align with how these method types
are named elsehwere .. i.e. waitFor rather than waitUntil..
M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionTransitionProcedure.java
Cleanup message we emit when we find an exisitng procedure working
against this entity.
Add support for a force function which allows Assigns/Unassigns force
ownership of the Region entity.
A hbase-server/src/test/java/org/apache/hadoop/hbase/master/assignment/TestRegionBypass.java
Test bypass and force.
M hbase-shell/src/main/ruby/shell/commands/list_procedures.rb
Minor cleanup of the json output... do iso8601 timestamps.
This reverts commit b96905d1df.
i.e. a revert of a revert so a reapplication!
Revert so I can add signed-off-by....
Signed-off-by: Allan Yang <allan163@apache.org>
bypass on an assign/unassign leaves region in RIT and the
RegionStateNode loaded with the bypassed procedure. First
implementation had assign/unassign cleanup leftover state.
Second implementation, on feedback, keeps the state in place
as a fence against other Procedures assuming the region entity,
and instead adds an 'override' function that hbck2 can set on
assigns/unassigns to override the fencing.
Note that the below also converts ProcedureExceptions that
come out of the Pv2 system into DoNotRetryIOEs. It is a
little awkward because DNRIOE is in client-module, not
in procedure module. Previous, we'd just keep retrying
the bypass, etc.
M hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/Procedure.java
Have bypass take an environment like all other methods so subclasses.
Fix javadoc issues.
M hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/ProcedureExecutor.java
Javadoc issues. Pass environment when we invoke bypass.
M hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
Rename waitUntilNamespace... etc. to align with how these method types
are named elsehwere .. i.e. waitFor rather than waitUntil..
M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionTransitionProcedure.java
Cleanup message we emit when we find an exisitng procedure working
against this entity.
Add support for a force function which allows Assigns/Unassigns force
ownership of the Region entity.
A hbase-server/src/test/java/org/apache/hadoop/hbase/master/assignment/TestRegionBypass.java
Test bypass and force.
M hbase-shell/src/main/ruby/shell/commands/list_procedures.rb
Minor cleanup of the json output... do iso8601 timestamps.
This patch switches to the builder pattern by adding a helper method.
It also checks to ensure that the pattern is available (i.e. that
HBase is running on a hadoop version that supports it).
Amending-Author: Mike Drob <mdrob@apache.org>
Signed-off-by: tedyu <yuzhihong@gmail.com>
Signed-off-by: zhangduo <zhangduo@apache.org>
Provides an extra client descriptor to build a second
tarball with a reduced set of dependencies. Not of great
impact now, but will build the way for better in the future.
Signed-off-by: Sean Busbey <busbey@apache.org>
Conflicts:
hbase-assembly/pom.xml
Conflicts:
hbase-spark/pom.xml
A reattempt at fixing HBASE-20173 [AMv2] DisableTableProcedure concurrent to ServerCrashProcedure can deadlock
The scenario is a SCP after processing WALs, goes to assign regions that
were on the crashed server but a concurrent Procedure gets in there
first and tries to unassign a region that was on the crashed server
(could be part of a move procedure or a disable table, etc.). The
unassign happens to run AFTER SCP has released all RPCs that
were going against the crashed server. The unassign fails because the
server is crashed. The unassign used to suspend itself only it would
never be woken up because the server it was going against had already
been processed. Worse, the SCP could not make progress because the
unassign was suspended with the lock on a region that it wanted to
assign held making it so it could make no progress.
In here, we add to the unassign recognition of the state where it is
running post SCP cleanup of RPCs. If present, unassign moves to finish
instead of suspending itself.
Includes a nice unit test made by Duo Zhang that reproduces nicely the
hung scenario.
M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/FailedRemoteDispatchException.java
Moved this class back to hbase-procedure where it belongs.
M hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/NoNodeDispatchException.java
M hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/NoServerDispatchException.java
M hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/NullTargetServerDispatchException.java
Specializiations on FRDE so we can be more particular when we say there
was a problem.
M hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/RemoteProcedureDispatcher.java
Change addOperationToNode so we throw exceptions that give more detail
on issue rather than a mysterious true/false
M hbase-protocol-shaded/src/main/protobuf/MasterProcedure.proto
Undo SERVER_CRASH_HANDLE_RIT2. Bad idea (from HBASE-20173)
M hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
Have expireServer return true if it actually queued an expiration. Used
later in this patch.
M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java
Hide methods that shouldn't be public. Add a particular check used out
in unassign procedure failure processing.
M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/MoveRegionProcedure.java
Check that server we're to move from is actually online (might
catch a few silly move requests early).
M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionStates.java
Add doc on ServerState. Wasn't being used really. Now we actually stamp
a Server OFFLINE after its WAL has been split. Means its safe to assign
since all WALs have been processed. Add methods to update SPLITTING
and to set it to OFFLINE after splitting done.
M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionTransitionProcedure.java
Change logging to be new-style and less repetitive of info.
Cater to new way in which .addOperationToNode returns info (exceptions
rather than true/false).
M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/UnassignProcedure.java
Add looking for the case where we failed assign AND we should not
suspend because we will never be woken up because SCP is beyond
doing this for all stuck RPCs.
Some cleanup of the failure processing grouping where we can proceed.
TODOs have been handled in this refactor including the TODO that
wonders if it possible that there are concurrent fails coming in
(Yes).
M hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java
Doc and removing the old HBASE-20173 'fix'.
Also updating ServerStateNode post WAL splitting so it gets marked
OFFLINE.
A hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestServerCrashProcedureStuck.java
Nice test by Duo Zhang.
Signed-off-by: Umesh Agashe <uagashe@cloudera.com>
Signed-off-by: Duo Zhang <palomino219@gmail.com>
Signed-off-by: Mike Drob <mdrob@apache.org>
Add backoff when stuck in RegionTransitionProcedure, the subclass of
AssignProcedure and UnassignProcedure. Can happen when we go to
transition but the current Region state is not what we expect.
M hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/Procedure.java
Add doc on being able to suspend and wait on a timeout.
M hbase-protocol-shaded/src/main/protobuf/MasterProcedure.proto
Add 'attempt' counter so we can do backoff when we get stuck.
M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignProcedure.java
M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/UnassignProcedure.java
Add persistence of new 'attempt' counter
M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionTransitionProcedure.java
Doc data members that are persisted by subclasses given this is 'odd'.
Add a counter for 'attempts' used when 'stuck' to implement backoff.
Add suspend with timeout when 'stuck'. Add callback when timeout is
exhausted which does wakeup of this procedure.
A hbase-server/src/test/java/org/apache/hadoop/hbase/master/assignment/TestUnexpectedStateException.java
Test of backoff.
rollWriter() fails after creating the file and returns false. In next iteration of while loop in recoverLease() file list is refreshed.
Signed-off-by: Appy <appy@cloudera.com>
Remove commons-cli and commons-collections4 use. Account
for the newer internal protobuf version of 3.5.1.
Signed-off-by: Michael Stack <stack@apache.org>
Signed-off-by: Mike Drob <mdrob@apache.org>
Changes for HBASE-20027 seem to cause UI not showing up on default port in standalone mode. For concurrent
unit test execution, individual tests can set hbase.localcluster.assign.random.ports to true or modify
test/resources/hbase-site.xml.
Changes for HBASE-20027 seem to cause UI not showing up on default port in standalone mode. For concurrent
unit test execution, individual tests can set hbase.localcluster.assign.random.ports to true or modify
test/resources/hbase-site.xml.
Adds a prepare step to RecoverMetaProcedure in which we test for
cluster up and master being up. If not up, we fail the run.
Modified hbase-server/src/main/java/org/apache/hadoop/hbase/master/cleaner/HFileCleaner.java
Modified hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/ChunkCreator.java
Minor log cleanup.
Modified hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/RecoverMetaProcedure.java
Add pepare step.
Modified hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManagerMetrics.java
Debug for the failing test....
Added hbase-server/src/test/java/org/apache/hadoop/hbase/master/procedure/TestRecoverMetaProcedure.java
Test the prepare step goes down if master or cluster are down.
in-memory compactions)
Log less. Log using same format as used elsewhere in log.
Align logs in HFileArchiver with how we format elsewhere. Removed
redundant 'region' qualifiers, tried to tighten up the emissions so
easier to read the long lines.
M hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/ChunkCreator.java
Add a label for each of the chunkcreators we make (I was confused by
two chunk creater stats emissions in log file -- didn't know that one
was for data and the other index).
M hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/CompactSplit.java
Formatting. Log less.
M hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/MemStoreCompactionStrategy.java
Make the emissions in here trace-level. When more than a few regions,
log is filled with this stuff.
We assumed that we can run for loop from 0 to lastStep sequentially. MergeTableRegionProcedure skips step 2. So, when i is 0 the procedure is already at step 3.
Added a method StateMachineProcedure#getCurrentStateId that can be used from test code only.
* rely on git plumbing commands when checking if we've built the site for a particular commit already
* switch to forcing '-e' for bash
* add command line switches for: path to hbase, working directory, and publishing
* only export JAVA/MAVEN HOME if they aren't already set.
* add some docs about assumptions
* Update javadoc plugin to consistently be version 3.0.0
* avoid duplicative site invocations on reactor modules
* update use of cp command so it works both on linux and mac
* manually skip enforcer plugin during build
* still doing install of all jars due to MJAVADOC-490, but then skip rebuilding during aggregate reports.
* avoid the pager on git-diff by teeing to a log file, which also helps later reviewing in the case of big changesets.
Signed-off-by: Michael Stack <stack@apache.org>
Signed-off-by: Misty Stanley-Jones <misty@apache.org>
Conflicts:
hbase-backup/pom.xml
hbase-spark-it/pom.xml
Fix MoveRandomRegionOfTableAction. It depended on old AM behavior.
Make it do explicit move as is required in AMv3; w/o it, it was just
closing region causing test to fail.
Fix pom so hadoop3 profile specifies a different netty3 version.
Bunch of logging format change that came of trying trying to read
the spew from this test.
Rename the PE Worker threads.
Send an interrupt if worker taking a long time to go down
(it may be RPC'ing out to a dead server, retrying so
interrupt). Also join on the ProcedureExecutor shutting down.
This will make problems shutting down more obvious.
Disable TestRegionsOnMasterOptions. Master carrying Regions is broke.
Set the ProcedureExcecutor worker threads as daemon.
Ditto for the timeout thread.
Remove hack from TestRegionsOnMasterOptions that was
put in place because the test would not go down.
Set the ProcedureExcecutor worker threads as daemon.
Ditto for the timeout thread.
Remove hack from TestRegionsOnMasterOptions that was
put in place because the test would not go down.
Changes:
- replaced commons-logging to slf4j everywhere
- log.XXX(Throwable) calls were replaced with log.XXX(t.toString(), t)
- log.XXX(Object) calls were replaced with log.XXX(Objects.toString(obj))
- log.fatal() calls were replaced with log.error(HBaseMarkers.FATAL, ...)
- programmatic log4j configuration was removed from the unit test
This commit does not affect the current logging configurations, because log4j
is still on the classpath. slf4j-log4j12 binds log4j to slf4j.
Signed-off-by: Michael Stack <stack@apache.org>
Added new bulk assign createRoundRobinAssignProcedure to complement
the existing createAssignProcedure. The former asks the balancer for
target servers to set into the created AssignProcedures. The latter
sets no target server into AssignProcedure. When no target server
is specified, we make effort at assign-time at trying to deploy the
region to its old location if there was one.
The new round robin assign procedure creator does not do this. Use
the new round robin method on table create or reenabling offline
regions. Use the old assign in ServerCrashProcedure or in
EnableTable so there is a chance we retain locality.
Bulk preassigning passing all to-be-assigned to the balancer in one
go is good for ensuring good distribution especially when read
replicas in the mix.
The old assign was single-assign scoped so region replicas could
end up on the same server.
M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignProcedure.java
Cleanup around forceNewPlan. Was confusing.
Added a Comparator to sort AssignProcedures so meta and system tables
come ahead of user-space tables.
M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java
Remove the forceNewPlan argument on createAssignProcedure. Didn't make
sense given we were creating a new AssignProcedure; the arg had no
effect.
(createRoundRobinAssignProcedures) Recast to feed all regions to the balancer in
bulk and to sort the return so meta and system tables take precedence.
Miscellaneous fixes including keeping the Master around until all
RegionServers are down, documentation on how assignment retention
works, etc.
- Adds debug logging for future ease
- Removes 60s timeout since testRecoveryAndDoubleExecutionPreserveSplits is only halfway after a minute.
- Adds some comments
- Logging change: Some places report "regionState=" while others just "state=".
State machine procs also have "state=" in their logs. Let me change all region related logging to "regionState=" so that
1) it's consistent everywhere, 2) more filtered results when searching through logs.
- Adding javadoc comments
- Bug: ServerStateNode#regions is HashSet but there's no synchronization to prevent concurrent addRegion/removeRegion. Let's use concurrent set instead.
- Use getRegionsInTransitionCount() directly to avoid instead of getRegionsInTransition().size() because the latter copies everything into a new array - what a waste for just the size.
- There's mixed use of getRegionNode and getRegionStateNode for same return type - RegionStateNode. Changing everything to getRegionStateNode. Similarly rename other *RegionNode() fns to *RegionStateNode().
- RegionStateNode#transitionState() return value is useless since it always returns it's first param.
- Other minor improvements
Updated HTrace version to 4.2
Created TraceUtil class to wrap htrace methods. Uses try with resources.
Signed-off-by: Balazs Meszaros <balazs.meszaros@cloudera.com>
Signed-off-by: Michael Stack <stack@apache.org>
* pull things that don't rely on HDFS in hbase-server/FSUtils into hbase-common/CommonFSUtils
* refactor setStoragePolicy so that it can move into hbase-common/CommonFSUtils, as a side effect update it for Hadoop 2.8,3.0+
* refactor WALProcedureStore so that it handles its own FS interactions
* add a reflection-based lookup of stream capabilities
* call said lookup in places where we make WALs to make sure hflush/hsync is available.
* javadoc / checkstyle cleanup on changes as flagged by yetus
Signed-off-by: Chia-Ping Tsai <chia7712@gmail.com>
Includes partial backport of hbase-build-configuration module
Signed-off-by: Michael Stack <stack@apache.org>
Signed-off-by: Misty Stanley-Jones <misty@apache.org>
Main changes:
- ProcedureInfo and LockInfo were removed, we use JSON instead of them
- Procedure and LockedResource are their server side equivalent
- Procedure protobuf state_data became obsolate, it is only kept for
reading previously written WAL
- Procedure protobuf contains a state_message field, which stores the internal
state messages (Any type instead of bytes)
- Procedure.serializeStateData and deserializeStateData were changed slightly
- Procedures internal states are available on client side
- Procedures are displayed on web UI and in shell in the following jruby format:
{ ID => '1', PARENT_ID = '-1', PARAMETERS => [ ..extra state information.. ] }
Signed-off-by: Michael Stack <stack@apache.org>
Upgrade commons-collections:3.2.2 to commons-collections4:4.1
Add missing dependency for hbase-procedure, hbase-thrift
Replace CircularFifoBuffer with CircularFifoQueue in WALProcedureStore and TaskMonitor
Signed-off-by: Sean Busbey <busbey@apache.org>
Signed-off-by: Chia-Ping Tsai <chia7712@gmail.com>
(cherry picked from commit 137b105c67)