hbase

Author	SHA1	Message	Date
zhangduo	a472f24d17	HBASE-20634 Reopen region while server crash can cause the procedure to be stuck A reattempt at fixing HBASE-20173 [AMv2] DisableTableProcedure concurrent to ServerCrashProcedure can deadlock The scenario is a SCP after processing WALs, goes to assign regions that were on the crashed server but a concurrent Procedure gets in there first and tries to unassign a region that was on the crashed server (could be part of a move procedure or a disable table, etc.). The unassign happens to run AFTER SCP has released all RPCs that were going against the crashed server. The unassign fails because the server is crashed. The unassign used to suspend itself only it would never be woken up because the server it was going against had already been processed. Worse, the SCP could not make progress because the unassign was suspended with the lock on a region that it wanted to assign held making it so it could make no progress. In here, we add to the unassign recognition of the state where it is running post SCP cleanup of RPCs. If present, unassign moves to finish instead of suspending itself. Includes a nice unit test made by Duo Zhang that reproduces nicely the hung scenario. M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/FailedRemoteDispatchException.java Moved this class back to hbase-procedure where it belongs. M hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/NoNodeDispatchException.java M hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/NoServerDispatchException.java M hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/NullTargetServerDispatchException.java Specializiations on FRDE so we can be more particular when we say there was a problem. M hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/RemoteProcedureDispatcher.java Change addOperationToNode so we throw exceptions that give more detail on issue rather than a mysterious true/false M hbase-protocol-shaded/src/main/protobuf/MasterProcedure.proto Undo SERVER_CRASH_HANDLE_RIT2. Bad idea (from HBASE-20173) M hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java Have expireServer return true if it actually queued an expiration. Used later in this patch. M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java Hide methods that shouldn't be public. Add a particular check used out in unassign procedure failure processing. M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/MoveRegionProcedure.java Check that server we're to move from is actually online (might catch a few silly move requests early). M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionStates.java Add doc on ServerState. Wasn't being used really. Now we actually stamp a Server OFFLINE after its WAL has been split. Means its safe to assign since all WALs have been processed. Add methods to update SPLITTING and to set it to OFFLINE after splitting done. M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionTransitionProcedure.java Change logging to be new-style and less repetitive of info. Cater to new way in which .addOperationToNode returns info (exceptions rather than true/false). M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/UnassignProcedure.java Add looking for the case where we failed assign AND we should not suspend because we will never be woken up because SCP is beyond doing this for all stuck RPCs. Some cleanup of the failure processing grouping where we can proceed. TODOs have been handled in this refactor including the TODO that wonders if it possible that there are concurrent fails coming in (Yes). M hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java Doc and removing the old HBASE-20173 'fix'. Also updating ServerStateNode post WAL splitting so it gets marked OFFLINE. A hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestServerCrashProcedureStuck.java Nice test by Duo Zhang. Signed-off-by: Umesh Agashe <uagashe@cloudera.com> Signed-off-by: Duo Zhang <palomino219@gmail.com> Signed-off-by: Mike Drob <mdrob@apache.org>	2018-06-04 09:26:56 -07:00
zhangduo	997747076d	HBASE-20659 Implement a reopen table regions procedure	2018-05-30 20:03:25 +08:00
Sean Busbey	8ba2a7eeb9	HBASE-20544 Make HBTU default to random ports. Signed-off-by: Umesh Agashe <uagashe@cloudera.com> Signed-off-by: Josh Elser <elserj@apache.org>	2018-05-09 23:35:20 -07:00
Chia-Ping Tsai	4cb444e77b	HBASE-20169 NPE when calling HBTU.shutdownMiniCluster (TestAssignmentManagerMetrics is flakey); AMENDMENT	2018-05-02 16:14:58 -07:00
Michael Stack	5a071dbe2b	HBASE-20492 UnassignProcedure is stuck in retry loop on region stuck in OPENING state Add backoff when stuck in RegionTransitionProcedure, the subclass of AssignProcedure and UnassignProcedure. Can happen when we go to transition but the current Region state is not what we expect. M hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/Procedure.java Add doc on being able to suspend and wait on a timeout. M hbase-protocol-shaded/src/main/protobuf/MasterProcedure.proto Add 'attempt' counter so we can do backoff when we get stuck. M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignProcedure.java M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/UnassignProcedure.java Add persistence of new 'attempt' counter M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionTransitionProcedure.java Doc data members that are persisted by subclasses given this is 'odd'. Add a counter for 'attempts' used when 'stuck' to implement backoff. Add suspend with timeout when 'stuck'. Add callback when timeout is exhausted which does wakeup of this procedure. A hbase-server/src/test/java/org/apache/hadoop/hbase/master/assignment/TestUnexpectedStateException.java Test of backoff.	2018-04-30 20:40:22 -07:00
Wei-Chiu Chuang	17a29ac231	HBASE-20338 WALProcedureStore#recoverLease() should have fixed sleeps for retrying rollWriter() Signed-off-by: Mike Drob <mdrob@apache.org> Signed-off-by: Umesh Agashe <uagashe@cloudera.com> Signed-off-by: Chia-Ping Tsai <chia7712@gmail.com>	2018-04-12 16:33:55 -05:00
Umesh Agashe	e4b51bb27d	HBASE-20330 ProcedureExecutor.start() gets stuck in recover lease on store rollWriter() fails after creating the file and returns false. In next iteration of while loop in recoverLease() file list is refreshed. Signed-off-by: Appy <appy@cloudera.com>	2018-04-11 16:07:42 -07:00
Josh Elser	15c398f7d2	HBASE-20223 Update to hbase-thirdparty 2.1.0 Remove commons-cli and commons-collections4 use. Account for the newer internal protobuf version of 3.5.1. Signed-off-by: Michael Stack <stack@apache.org> Signed-off-by: Mike Drob <mdrob@apache.org>	2018-03-26 22:05:19 -04:00
Umesh Agashe	c614b9f3e8	HBASE-20224 Web UI is broken in standalone mode Changes for HBASE-20027 seem to cause UI not showing up on default port in standalone mode. For concurrent unit test execution, individual tests can set hbase.localcluster.assign.random.ports to true or modify test/resources/hbase-site.xml.	2018-03-22 20:27:39 -07:00
Michael Stack	5d1b2110d1	Revert "HBASE-20224 Web UI is broken in standalone mode" Broke shell tests. This reverts commit dd9fe813ecc605f5e8b3c8598824f4e9a0a1eed6.	2018-03-22 10:57:42 -07:00
Umesh Agashe	4cb40e6d84	HBASE-20224 Web UI is broken in standalone mode Changes for HBASE-20027 seem to cause UI not showing up on default port in standalone mode. For concurrent unit test execution, individual tests can set hbase.localcluster.assign.random.ports to true or modify test/resources/hbase-site.xml.	2018-03-22 06:52:20 -07:00
Michael Stack	acbdb86bb4	HBASE-20169 NPE when calling HBTU.shutdownMiniCluster Adds a prepare step to RecoverMetaProcedure in which we test for cluster up and master being up. If not up, we fail the run. Modified hbase-server/src/main/java/org/apache/hadoop/hbase/master/cleaner/HFileCleaner.java Modified hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/ChunkCreator.java Minor log cleanup. Modified hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/RecoverMetaProcedure.java Add pepare step. Modified hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManagerMetrics.java Debug for the failing test.... Added hbase-server/src/test/java/org/apache/hadoop/hbase/master/procedure/TestRecoverMetaProcedure.java Test the prepare step goes down if master or cluster are down.	2018-03-20 13:11:57 -07:00
Michael Stack	bedf849d83	HBASE-20213 [LOGGING] Aligning formatting and logging less (compactions, in-memory compactions) Log less. Log using same format as used elsewhere in log. Align logs in HFileArchiver with how we format elsewhere. Removed redundant 'region' qualifiers, tried to tighten up the emissions so easier to read the long lines. M hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/ChunkCreator.java Add a label for each of the chunkcreators we make (I was confused by two chunk creater stats emissions in log file -- didn't know that one was for data and the other index). M hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/CompactSplit.java Formatting. Log less. M hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/MemStoreCompactionStrategy.java Make the emissions in here trace-level. When more than a few regions, log is filled with this stuff.	2018-03-16 13:16:49 -07:00
Umesh Agashe	974200fca1	HBASE-20024 Fixed flakyness of TestMergeTableRegionsProcedure We assumed that we can run for loop from 0 to lastStep sequentially. MergeTableRegionProcedure skips step 2. So, when i is 0 the procedure is already at step 3. Added a method StateMachineProcedure#getCurrentStateId that can be used from test code only.	2018-03-09 12:45:39 -08:00
Michael Stack	b11e506664	HBASE-20069 fix existing findbugs errors in hbase-server	2018-02-26 16:01:31 -08:00
Michael Stack	549a6d93d4	HBASE-20043 ITBLL fails against hadoop3 Fix MoveRandomRegionOfTableAction. It depended on old AM behavior. Make it do explicit move as is required in AMv3; w/o it, it was just closing region causing test to fail. Fix pom so hadoop3 profile specifies a different netty3 version. Bunch of logging format change that came of trying trying to read the spew from this test.	2018-02-24 17:29:54 -08:00
Michael Stack	51cea3e2c3	HBASE-20024 TestMergeTableRegionsProcedure is STILL flakey	2018-02-20 11:08:27 -08:00
zhangduo	391790ddb0	HBASE-19978 The keepalive logic is incomplete in ProcedureExecutor	2018-02-19 17:13:47 -08:00
Michael Stack	0593dda663	HBASE-19951 Cleanup the explicit timeout value for test method	2018-02-10 09:24:31 -08:00
Michael Stack	06dec20582	HBASE-19919 Tidying up logging	2018-02-03 08:42:02 -08:00
zhangduo	918599ef12	HBASE-19873 Add a CategoryBasedTimeout ClassRule for all UTs	2018-01-29 08:43:56 +08:00
Thiruvel Thirumoolan	ce50830a0a	HBASE-19756 Master NPE during completed failed proc eviction Signed-off-by: Andrew Purtell <apurtell@apache.org>	2018-01-24 16:42:58 -08:00
Michael Stack	7fe4aa6fe4	HBASE-19828 Flakey TestRegionsOnMasterOptions.testRegionsOnAllServers Rename the PE Worker threads. Send an interrupt if worker taking a long time to go down (it may be RPC'ing out to a dead server, retrying so interrupt). Also join on the ProcedureExecutor shutting down. This will make problems shutting down more obvious. Disable TestRegionsOnMasterOptions. Master carrying Regions is broke.	2018-01-19 21:54:19 -08:00
Michael Stack	646770dd51	HBASE-19527 Make ExecutorService threads daemon=true Set the ProcedureExcecutor worker threads as daemon. Ditto for the timeout thread. Remove hack from TestRegionsOnMasterOptions that was put in place because the test would not go down.	2018-01-18 11:30:15 -08:00
Peter Somogyi	c269e63a07	HBASE-19809 Fix findbugs and error-prone warnings in hbase-procedure (branch-2)	2018-01-17 11:23:38 -08:00
zhangduo	7f4bd0d371	HBASE-19524 Master side changes for moving peer modification from zk watcher to procedure	2018-01-09 13:11:01 +08:00
zhangduo	f17198ff19	HBASE-19216 Implement a general framework to execute remote procedure on RS	2018-01-09 13:11:01 +08:00
Mike Drob	c3b4f788b1	HBASE-19552 find-and-replace thirdparty offset	2017-12-28 11:52:32 -06:00
Chia-Ping Tsai	01b1f48ccd	HBASE-19644 add the checkstyle rule to reject the illegal imports	2017-12-28 04:10:42 +08:00
Peter Somogyi	35728acd21	HBASE-19578 MasterProcWALs cleaning is incorrect Signed-off-by: tedyu <yuzhihong@gmail.com>	2017-12-21 09:38:25 -08:00
Balazs Meszaros	f572c4b80e	HBASE-10092 Move up on to log4j2 Changes: - replaced commons-logging to slf4j everywhere - log.XXX(Throwable) calls were replaced with log.XXX(t.toString(), t) - log.XXX(Object) calls were replaced with log.XXX(Objects.toString(obj)) - log.fatal() calls were replaced with log.error(HBaseMarkers.FATAL, ...) - programmatic log4j configuration was removed from the unit test This commit does not affect the current logging configurations, because log4j is still on the classpath. slf4j-log4j12 binds log4j to slf4j. Signed-off-by: Michael Stack <stack@apache.org>	2017-12-20 22:21:33 -08:00
Michael Stack	7f938dd980	HBASE-19218 Master stuck thinking hbase:namespace is assigned after restart preventing intialization Signed-off-by: Li Xiang <easyliangjob@gmail.com>	2017-12-20 21:47:10 -08:00
Guanghao Zhang	6c6a9d2d1c	HBASE-19563 A few hbase-procedure classes missing @InterfaceAudience annotation	2017-12-20 09:33:06 -08:00
Jan Hentschel	f46a6d1637	HBASE-19540 Reduced number of unnecessary semicolons	2017-12-19 20:06:59 +01:00
Michael Stack	010012cbcb	HBASE-18946 Stochastic load balancer assigns replica regions to the same RS Added new bulk assign createRoundRobinAssignProcedure to complement the existing createAssignProcedure. The former asks the balancer for target servers to set into the created AssignProcedures. The latter sets no target server into AssignProcedure. When no target server is specified, we make effort at assign-time at trying to deploy the region to its old location if there was one. The new round robin assign procedure creator does not do this. Use the new round robin method on table create or reenabling offline regions. Use the old assign in ServerCrashProcedure or in EnableTable so there is a chance we retain locality. Bulk preassigning passing all to-be-assigned to the balancer in one go is good for ensuring good distribution especially when read replicas in the mix. The old assign was single-assign scoped so region replicas could end up on the same server. M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignProcedure.java Cleanup around forceNewPlan. Was confusing. Added a Comparator to sort AssignProcedures so meta and system tables come ahead of user-space tables. M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java Remove the forceNewPlan argument on createAssignProcedure. Didn't make sense given we were creating a new AssignProcedure; the arg had no effect. (createRoundRobinAssignProcedures) Recast to feed all regions to the balancer in bulk and to sort the return so meta and system tables take precedence. Miscellaneous fixes including keeping the Master around until all RegionServers are down, documentation on how assignment retention works, etc.	2017-12-15 08:53:41 -08:00
Mike Drob	2c9ef8a471	HBASE-19289 Add flag to disable stream capability enforcement Signed-off-by: Josh Elser <elserj@apache.org>	2017-12-14 12:19:22 -06:00
Apekshit Sharma	7092b814bd	HBASE-19457 Debugging flaky TestTruncateTableProcedure - Adds debug logging for future ease - Removes 60s timeout since testRecoveryAndDoubleExecutionPreserveSplits is only halfway after a minute. - Adds some comments - Logging change: Some places report "regionState=" while others just "state=". State machine procs also have "state=" in their logs. Let me change all region related logging to "regionState=" so that 1) it's consistent everywhere, 2) more filtered results when searching through logs.	2017-12-08 17:25:16 -08:00
Apekshit Sharma	81b95afbee	HBASE-19367 Refactoring in RegionStates, and RSProcedureDispatcher - Adding javadoc comments - Bug: ServerStateNode#regions is HashSet but there's no synchronization to prevent concurrent addRegion/removeRegion. Let's use concurrent set instead. - Use getRegionsInTransitionCount() directly to avoid instead of getRegionsInTransition().size() because the latter copies everything into a new array - what a waste for just the size. - There's mixed use of getRegionNode and getRegionStateNode for same return type - RegionStateNode. Changing everything to getRegionStateNode. Similarly rename other RegionNode() fns to RegionStateNode(). - RegionStateNode#transitionState() return value is useless since it always returns it's first param. - Other minor improvements	2017-11-29 22:40:11 -08:00
Apekshit Sharma	f886716617	HBASE-19319 Fix bug in synchronizing over ProcedureEvent Also moves event related functions (wake/wait/suspend) from ProcedureScheduler to ProcedureEvent class	2017-11-27 11:51:17 -08:00
Mike Drob	3a0f59d031	HBASE-18983 update error-prone to 2.1.1	2017-11-04 21:28:52 -05:00
Sean Busbey	e79a007dd9	HBASE-18784 if available, query underlying outputstream capabilities where we need hflush/hsync. * pull things that don't rely on HDFS in hbase-server/FSUtils into hbase-common/CommonFSUtils * refactor setStoragePolicy so that it can move into hbase-common/CommonFSUtils, as a side effect update it for Hadoop 2.8,3.0+ * refactor WALProcedureStore so that it handles its own FS interactions * add a reflection-based lookup of stream capabilities * call said lookup in places where we make WALs to make sure hflush/hsync is available. * javadoc / checkstyle cleanup on changes as flagged by yetus Signed-off-by: Chia-Ping Tsai <chia7712@gmail.com>	2017-11-02 21:29:20 -05:00
Sean Busbey	4b124913f0	HBASE-17823 Migrate to Apache Yetus Audience Annotations Signed-off-by: Michael Stack <stack@apache.org> Signed-off-by: Misty Stanley-Jones <misty@apache.org>	2017-09-12 20:53:30 -05:00
Balazs Meszaros	359fed7b4b	HBASE-18106 Redo ProcedureInfo and LockInfo Main changes: - ProcedureInfo and LockInfo were removed, we use JSON instead of them - Procedure and LockedResource are their server side equivalent - Procedure protobuf state_data became obsolate, it is only kept for reading previously written WAL - Procedure protobuf contains a state_message field, which stores the internal state messages (Any type instead of bytes) - Procedure.serializeStateData and deserializeStateData were changed slightly - Procedures internal states are available on client side - Procedures are displayed on web UI and in shell in the following jruby format: { ID => '1', PARENT_ID = '-1', PARAMETERS => [ ..extra state information.. ] } Signed-off-by: Michael Stack <stack@apache.org>	2017-09-08 10:24:04 -07:00
Peter Somogyi	137b105c67	HBASE-18704 Upgrade hbase to commons-collections 4 Upgrade commons-collections:3.2.2 to commons-collections4:4.1 Add missing dependency for hbase-procedure, hbase-thrift Replace CircularFifoBuffer with CircularFifoQueue in WALProcedureStore and TaskMonitor Signed-off-by: Sean Busbey <busbey@apache.org> Signed-off-by: Chia-Ping Tsai <chia7712@gmail.com>	2017-09-07 10:30:01 -05:00
Michael Stack	6f44b24860	HBASE-18551 [AMv2] UnassignProcedure and crashed regionservers If an unassign is unable to communicate with its target server, expire the server and then wait on a signal from ServerCrashProcedure before proceeding. The unassign has lock on the region so no one else can proceed till we complete. We prevent any subsequent assign from running until logs have been split for crashed server. In AssignProcedure, do not assign if table is DISABLING or DISABLED. M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionTransitionProcedure.java Change remoteCallFailed so it returns boolean on whether implementor wants to stay suspended or not. M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/UnassignProcedure.java Doc. Also, if we are unable to talk to remote server, expire it and then wait on SCP to wake us up after it has processed logs for failed server.	2017-08-11 07:16:33 -07:00
Michael Stack	e4ba404a5a	Revert "HBASE-18551 [AMv2] UnassignProcedure and crashed regionservers" This reverts commit 2dd75d10f8818ed31fcc36bd89024e9ad728ae41.	2017-08-10 14:59:52 -07:00
Michael Stack	2dd75d10f8	HBASE-18551 [AMv2] UnassignProcedure and crashed regionservers If an unassign is unable to communicate with its target server, expire the server and then wait on a signal from ServerCrashProcedure before proceeding. The unassign has lock on the region so no one else can proceed till we complete. We prevent any subsequent assign from running until logs have been split for crashed server. In AssignProcedure, do not assign if table is DISABLING or DISABLED. M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionTransitionProcedure.java Change remoteCallFailed so it returns boolean on whether implementor wants to stay suspended or not. M hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/UnassignProcedure.java Doc. Also, if we are unable to talk to remote server, expire it and then wait on SCP to wake us up after it has processed logs for failed server.	2017-08-10 14:53:35 -07:00
Umesh Agashe	a5db120e60	HBASE-18261 Created RecoverMetaProcedure and used it from ServerCrashProcedure and HMaster.finishActiveMasterInitialization(). This procedure can be used from any code before accessing meta, to initialize/ recover meta Signed-off-by: Michael Stack <stack@apache.org>	2017-07-31 14:25:03 -07:00
Balazs Meszaros	8f006582e3	HBASE-18367 Reduce ProcedureInfo usage Signed-off-by: Michael Stack <stack@apache.org>	2017-07-24 10:41:03 +01:00
Michael Stack	890d92a90c	HBASE-17908 Upgrade guava Pull in guava 22.0 by using the shaded version up in new hbase-thirdparty project. In poms, exclude guava everywhere except on hadoop-common. Do this so we minimize transitive includes. hadoop-common is needed because hadoop Configuration uses guava doing preconditions. Everywhere we used guava, instead use shaded so fix a load of imports. Stopwatch API changed as did hashing and toStringHelper which is now in MoreObjects class. Otherwise, minimal changes to come up on 22.0	2017-07-21 15:28:08 +01:00

1 2 3 4 5 ...

276 Commits