Commit Graph

155 Commits

Author SHA1 Message Date
Michael Stack a3c5a74487 Revert "HBASE-14614 Procedure v2 - Core Assignment Manager (Matteo Bertozzi)"
Revert a mistaken commit!!!

This reverts commit dc1065a85d.
2017-05-24 23:31:36 -07:00
Michael Stack dc1065a85d HBASE-14614 Procedure v2 - Core Assignment Manager (Matteo Bertozzi)
Move to a new AssignmentManager, one that describes Assignment using
a State Machine built on top of ProcedureV2 facility.

This doc. keeps state on where we are at w/ the new AM:
https://docs.google.com/document/d/1eVKa7FHdeoJ1-9o8yZcOTAQbv0u0bblBlCCzVSIn69g/edit#heading=h.vfdoxqut9lqn
Includes list of tests disabled by this patch with reasons why.

Based on patches from Matteos' repository and then fix up to get it all to pass cluster
tests, filling in some missing functionality, fix of findbugs, fixing bugs, etc..
including:

1. HBASE-14616 Procedure v2 - Replace the old AM with the new AM.
The basis comes from Matteo's repo here:
689227fcbf

Patch replaces old AM with the new under subpackage master.assignment.
Mostly just updating classes to use new AM -- import changes -- rather
than the old. It also removes old AM and supporting classes.
See below for more detail.

2. HBASE-14614 Procedure v2 - Core Assignment Manager (Matteo Bertozzi)
3622cba4e3

Adds running of remote procedure. Adds batching of remote calls.
Adds support for assign/unassign in procedures. Adds version info
reporting in rpc. Adds start of an AMv2.

3. Reporting of remote RS version is from here:
ddb4df3964.patch

4. And remote dispatch of procedures is from:
186b9e7c4d

5. The split merge patches from here are also melded in:
9a3a95a2c2
and d6289307a0

We add testing util for new AM and new sets of tests.

Does a bunch of fixup on logging so its possible to follow a procedures' narrative by grepping
procedure id. We spewed loads of log too on big transitions such as master fail; fixed.

Fix CatalogTracker. Make it use Procedures doing clean up of Region data on split/merge.
Without these changes, ITBLL was failing at larger scale (3-4hours 5B rows) because we were
splitting split Regions among other things (CJ would run but wasn't
taking lock on Regions so havoc).

Added a bunch of doc. on Procedure primitives.

Added new region-based state machine base class. Moved region-based
state machines on to it.

Found bugs in the way procedure locking was doing in a few of the
region-based Procedures. Having them all have same subclass helps here.

Added isSplittable and isMergeable to the Region Interface.

Master would split/merge even though the Regions still had
references. Fixed it so Master asks RegionServer if Region
is splittable.

Messing more w/ logging. Made all procedures log the same and report
the state the same; helps when logging is regular.

Rewrote TestCatalogTracker. Enabled TestMergeTableRegionProcedure.

Added more functionality to MockMasterServices so can use it doing
standalone testing of Procedures (made TestCatalogTracker use it
instead of its own version).

Add to MasterServices ability to wait on Master being up -- makes
it so can Mock Master and start to implement standalone split testing.
Start in on a Split region standalone test in TestAM.

Fix bug where a Split can fail because it comes in in the middle of
a Move (by holding lock for duration of a Move).

Breaks CPs that were watching merge/split. These are run by Master now
so you need to observe on Master, not on RegionServer.

Details:

M hbase-client/src/main/java/org/apache/hadoop/hbase/ClusterStatus.java
Takes List of regionstates on construction rather than a Set.
NOTE!!!!! This is a change in a public class.

M hbase-client/src/main/java/org/apache/hadoop/hbase/HRegionInfo.java
Add utility getShortNameToLog

M hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionImplementation.java
M hbase-client/src/main/java/org/apache/hadoop/hbase/client/ShortCircuitMasterConnection.java
Add support for dispatching assign, split and merge processes.

M hbase-client/src/main/java/org/apache/hadoop/hbase/master/RegionState.java
Purge old overlapping states: PENDING_OPEN, PENDING_CLOSE, etc.

M hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/Procedure.java
Lots of doc on its inner workings. Bug fixes.

M hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/ProcedureExecutor.java
Log and doc on workings. Bug fixes.

A hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/RemoteProcedureDispatcher.java
Dispatch remote procedures every 150ms or 32 items -- which ever
happens first (configurable). Runs a timeout thread. This facility is
not on yet; will come in as part of a later fix. Currently works a
region at a time. This class carries notion of a remote procedure and of a buffer full of these.
"hbase.procedure.remote.dispatcher.threadpool.size" with default = 128
"hbase.procedure.remote.dispatcher.delay.msec" with default = 150ms
"hbase.procedure.remote.dispatcher.max.queue.size" with default = 32

M hbase-client/src/main/java/org/apache/hadoop/hbase/shaded/protobuf/ProtobufUtil.java
Add in support for merge. Remove no-longer used methods.

M hbase-protocol-shaded/src/main/protobuf/Admin.proto b/hbase-protocol-shaded/src/main/protobuf/Admin.proto
Add execute procedures call ExecuteProcedures.

M hbase-protocol-shaded/src/main/protobuf/MasterProcedure.proto
Add assign and unassign state support for procedures.

M hbase-server/src/main/java/org/apache/hadoop/hbase/client/VersionInfoUtil.java
Adds getting RS version out of RPC
Examples: (1.3.4 is 0x0103004, 2.1.0 is 0x0201000)

M hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
Remove periodic metrics chore. This is done over in new AM now.
Replace AM with the new. Host the procedures executor.

M hbase-server/src/main/java/org/apache/hadoop/hbase/master/MasterMetaBootstrap.java b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/MasterMetaBootstrap.java
Have AMv2 handle assigning meta.

M hbase-server/src/main/java/org/apache/hadoop/hbase/master/MasterRpcServices.java
Extract version number of the server making rpc.

A hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignProcedure.java
Add new assign procedure. Runs assign via Procedure Dispatch.
There can only be one RegionTransitionProcedure per region running at the time,
since each procedure takes a lock on the region.

D hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignCallable.java
D hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
D hbase-server/src/main/java/org/apache/hadoop/hbase/master/BulkAssigner.java
D hbase-server/src/main/java/org/apache/hadoop/hbase/master/GeneralBulkAssigner.java b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/GeneralBulkAssigner.java
Remove these hacky classes that were never supposed to live longer than
a month or so to be replaced with real assigners.

D hbase-server/src/main/java/org/apache/hadoop/hbase/master/RegionStateStore.java
D hbase-server/src/main/java/org/apache/hadoop/hbase/master/RegionStates.java
D hbase-server/src/main/java/org/apache/hadoop/hbase/master/UnAssignCallable.java

A hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java
A procedure-based AM (AMv2).

TODO
 - handle region migration
 - handle meta assignment first
 - handle sys table assignment first (e.g. acl, namespace)
 - handle table priorities
  "hbase.assignment.bootstrap.thread.pool.size"; default size is 16.
  "hbase.assignment.dispatch.wait.msec"; default wait is 150
  "hbase.assignment.dispatch.wait.queue.max.size"; wait max default is 100
  "hbase.assignment.rit.chore.interval.msec"; default is 5 * 1000;
  "hbase.assignment.maximum.attempts"; default is 10;

 A hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/MoveRegionProcedure.java
 Procedure that runs subprocedure to unassign and then assign to new location

 A hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionStateStore.java
 Manage store of region state (in hbase:meta by default).

 A hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionStates.java
 In-memory state of all regions. Used by AMv2.

 A hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionTransitionProcedure.java
 Base RIT procedure for Assign and Unassign.

 A hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/UnassignProcedure.java
 Unassign procedure.

 A hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/RSProcedureDispatcher.java
 Run region assignement in a manner that pays attention to target server version.
 Adds "hbase.regionserver.rpc.startup.waittime"; defaults 60 seconds.
2017-05-24 20:47:25 -07:00
Josh Elser d671a1dbc6 HBASE-17955 Various reviewboard improvements to space quota work
Most notable change is to cache SpaceViolationPolicyEnforcement objects
in the write path. When a table has no quota or there is not SpaceQuotaSnapshot
for that table (yet), we want to avoid creating lots of
SpaceViolationPolicyEnforcement instances, caching one instance
instead. This will help reduce GC pressure.
2017-05-22 13:41:36 -04:00
Josh Elser 13af7f8ac6 HBASE-17002 JMX metrics and some UI additions for space quotas 2017-05-22 13:41:35 -04:00
ckulkarni eb6ded4849 HBASE-17448 Export metrics from RecoverableZooKeeper
Added metrics for RecoverableZooKeeper related to specific exceptions,
total failed ZooKeeper API calls and latency histograms for read,
write and sync operations. Also added unit tests for the same. Added
service provider for the ZooKeeper metrics implementation inside the
hadoop compatibility module.

Signed-off-by: Andrew Purtell <apurtell@apache.org>
2017-04-26 18:30:13 -07:00
Umesh Agashe c8461456d0 HBASE-17888: Added generic methods for updating metrics on submit and finish of a procedure execution
Signed-off-by: Michael Stack <stack@apache.org>
2017-04-14 11:51:08 -07:00
Jan Hentschel b53f354763 HBASE-17532 Replaced explicit type with diamond operator
Signed-off-by: Michael Stack <stack@apache.org>
2017-03-07 11:22:51 -08:00
Ashu Pachauri d2c083d21c HBASE-17627 Active workers metric for thrift (Ashu Pachauri)
Signed-off-by: Gary Helmling <garyh@apache.org>
2017-02-15 14:50:19 -08:00
Gary Helmling 7a6518f05d HBASE-17578 Thrift metrics should handle exceptions 2017-02-06 11:39:49 -08:00
Jerry He bc168b419d HBASE-17581 mvn clean test -PskipXXXTests does not work properly for some modules (Yi Liang) 2017-02-02 11:05:17 -08:00
Michael Stack fb2c89b1b3 HBASE-16785 We are not running all tests
M TestStressWALProcedureStore.java
Disable test that now runs that fails because of difference in pb3.1.0.

Signed-off-by: Michael Stack <stack@apache.org>
2017-01-26 21:49:18 -08:00
Michael Stack 1ee8776273 HBASE-16785 We are not running all tests Do nothing patch just to get baseline of how many tests we run 2017-01-26 12:21:00 -08:00
Enis Soztutar c64a1d1994 HBASE-9774 HBase native metrics and metric collection for coprocessors 2017-01-25 11:47:35 -08:00
Guanghao Zhang b3d8d06703 HBASE-17205 Add a metric for the duration of region in transition 2016-12-01 10:13:37 -08:00
Guanghao Zhang cc03f7ad53 HBASE-16561 Add metrics about read/write/scan queue length and active read/write/scan handler count
Signed-off-by: zhangduo <zhangduo@apache.org>
2016-11-29 09:43:30 +08:00
Enis Soztutar 03bc884ea0 HBASE-17017 Remove the current per-region latency histogram metrics 2016-11-08 18:31:53 -08:00
Dustin Pho fcef2c02c9 HBASE-16661 Add last major compaction age to per-region metrics
Signed-off-by: Gary Helmling <garyh@apache.org>
2016-10-10 15:16:12 -07:00
Sean Busbey 76396714e1 HBASE-15984 Handle premature EOF treatment of WALs in replication.
In some particular deployments, the Replication code believes it has
reached EOF for a WAL prior to succesfully parsing all bytes known to
exist in a cleanly closed file.

Consistently this failure happens due to an InvalidProtobufException
after some number of seeks during our attempts to tail the in-progress
RegionServer WAL. As a work-around, this patch treats cleanly closed
files differently than other execution paths. If an EOF is detected due
to parsing or other errors while there are still unparsed bytes before
the end-of-file trailer, we now reset the WAL to the very beginning and
attempt a clean read-through.

In current testing, a single such reset is sufficient to work around
observed dataloss. However, the above change will retry a given WAL file
indefinitely. On each such attempt, a log message like the below will
be emitted at the WARN level:

  Processing end of WAL file '{}'. At position {}, which is too far away
  from reported file length {}. Restarting WAL reading (see HBASE-15983
  for details).

Additionally, this patch adds some additional log detail at the TRACE
level about file offsets seen while handling recoverable errors. It also
add metrics that measure the use of this recovery mechanism.
2016-09-29 10:07:14 -05:00
Enis Soztutar eb112783ae HBASE-16604 Scanner retries on IOException can cause the scans to miss data - RECOMMIT after revert 2016-09-23 11:27:13 -07:00
Enis Soztutar 39db0cac78 Revert "HBASE-16604 Scanner retries on IOException can cause the scans to miss data"
This reverts commit 83cf44cd3f.

Reverting because accidental files are committed with this.
2016-09-23 11:25:23 -07:00
Enis Soztutar 83cf44cd3f HBASE-16604 Scanner retries on IOException can cause the scans to miss data 2016-09-22 12:06:11 -07:00
anoopsamjohn 1384c9a08d HBASE-16650 Wrong usage of BlockCache eviction stat for heap memory tuning. 2016-09-22 21:28:30 +05:30
Ashish Singhi 31f16d6aec HBASE-16471 Region Server metrics context will be wrong when machine hostname contain "master" word (Pankaj Kumar) 2016-08-24 18:59:44 +05:30
Geoffrey cb02be38ab HBASE-16448 Custom metrics for custom replication endpoints
Signed-off-by: Andrew Purtell <apurtell@apache.org>
2016-08-23 17:17:08 -07:00
stack 69d170063f HBASE-16181 Fix AssignmentManager MBean name (Reid Chan) 2016-07-29 16:40:39 -07:00
Reid Chan abfd584fe6 HBASE-14743 Add metrics around HeapMemoryManager. (Reid Chan)
Change-Id: I7305f7b7034b216930b5fb5c57de9ba5eabf96d8

Signed-off-by: Apekshit Sharma <appy@apache.org>
2016-07-25 14:32:38 -07:00
Apekshit Sharma eff38ccf8c Revert HBASE-14743 because of wrong attribution. Since I added commit message to the raw patch, it's making me as author instead of Reid. I should have used --author flag to set Reid as author.
This reverts commit 064271da16.
2016-07-25 14:32:38 -07:00
Apekshit Sharma 064271da16 HBASE-14743 Add metrics around HeapMemoryManager. (Reid Chan)
Change-Id: I60b2435355b3e605e7d91cbf5aca5d2988f26f33
2016-07-25 13:45:50 -07:00
Jingcheng Du 518faa735b HBASE-15353 Add metric for number of CallQueueTooBigException's 2016-06-24 13:00:06 +08:00
Gary Helmling f4cec2e202 HBASE-16085 Add a metric for failed compactions 2016-06-23 15:38:58 -07:00
stack 3a95552cfe HBASE-15948 Port "HADOOP-9956 RPC listener inefficiently assigns connections to readers" Adds HADOOP-9955 RPC idle connection closing is extremely inefficient
Changes how we do accounting of Connections to match how it is done in Hadoop.
Adds a ConnectionManager class. Adds new configurations for this new class.

"hbase.ipc.client.idlethreshold" 4000
"hbase.ipc.client.connection.idle-scan-interval.ms" 10000
"hbase.ipc.client.connection.maxidletime" 10000
"hbase.ipc.client.kill.max", 10
"hbase.ipc.server.handler.queue.size", 100

The new scheme does away with synchronization that purportedly would freeze out
reads while we were cleaning up stale connections (according to HADOOP-9955)

Also adds in new mechanism for accepting Connections by pulling in as many
as we can at a time adding them to a Queue instead of doing one at a time.
Can help when bursty traffic according to HADOOP-9956. Removes a blocking
while Reader is busy parsing a request. Adds configuration
"hbase.ipc.server.read.connection-queue.size" with default of 100 for
queue size.

Signed-off-by: stack <stack@apache.org>
2016-06-07 16:42:21 -07:00
stack e66ecd7db6 Revert "HBASE-15948 Port "HADOOP-9956 RPC listener inefficiently assigns connections to readers""
Revert mistaken commit...
This reverts commit e0b70c00e7.
2016-06-07 16:41:30 -07:00
stack 6d5a25935e Revert "HBASE-15967 Metric for active ipc Readers and make default fraction of cpu count"
Revert mistaken commit
This reverts commit 1125215aad.
2016-06-07 16:41:01 -07:00
stack 1125215aad HBASE-15967 Metric for active ipc Readers and make default fraction of cpu count
Add new metric hbase.regionserver.ipc.runningReaders
Also make it so Reader count is a factor of processor count
2016-06-07 13:10:14 -07:00
stack e0b70c00e7 HBASE-15948 Port "HADOOP-9956 RPC listener inefficiently assigns connections to readers"
Adds HADOOP-9955 RPC idle connection closing is extremely inefficient
Then removes queue added by HADOOP-9956 at Enis suggestion

    Changes how we do accounting of Connections to match how it is done in Hadoop.
    Adds a ConnectionManager class. Adds new configurations for this new class.

    "hbase.ipc.client.idlethreshold" 4000
    "hbase.ipc.client.connection.idle-scan-interval.ms" 10000
    "hbase.ipc.client.connection.maxidletime" 10000
    "hbase.ipc.client.kill.max", 10
    "hbase.ipc.server.handler.queue.size", 100

    The new scheme does away with synchronization that purportedly would freeze out
    reads while we were cleaning up stale connections (according to HADOOP-9955)

    Also adds in new mechanism for accepting Connections by pulling in as many
    as we can at a time adding them to a Queue instead of doing one at a time.
    Can help when bursty traffic according to HADOOP-9956. Removes a blocking
    while Reader is busy parsing a request. Adds configuration
    "hbase.ipc.server.read.connection-queue.size" with default of 100 for
    queue size.
2016-06-07 13:10:14 -07:00
Enis Soztutar b75b226804 HBASE-15740 Replication source.shippedKBs metric is undercounting because it is in KB 2016-05-09 10:25:49 -07:00
Alex Moundalexis 0bf065a5d5 HBASE-15768 fix capitalization of ZooKeeper usage
Signed-off-by: Sean Busbey <busbey@apache.org>
2016-05-05 15:35:44 -05:00
Enis Soztutar 4c0587134a HBASE-15671 Add per-table metrics on memstore, storefile and regionsize (Alicia Ying Shu) 2016-04-21 13:33:26 -07:00
Andrew Purtell b6617b4eb9 HBASE-15663 Hook up JvmPauseMonitor to ThriftServer 2016-04-20 17:37:35 -07:00
Andrew Purtell a330a2b505 HBASE-15662 Hook up JvmPauseMonitor to REST server 2016-04-20 17:37:35 -07:00
Andrew Purtell 2c26fe37ac HBASE-15614 Report metrics from JvmPauseMonitor 2016-04-20 17:37:34 -07:00
Enis Soztutar 18d70bc680 HBASE-15518 Add Per-Table metrics back (Alicia Ying Shu) 2016-04-20 14:35:45 -07:00
tedyu 8541fe4ad1 HBASE-15093 Replication can report incorrect size of log queue for the global source when multiwal is enabled (Ashu Pachauri) 2016-04-11 08:17:20 -07:00
Elliott Clark a71ce6e738 HBASE-14983 Create metrics for per block type hit/miss ratios
Summary: Missing a root index block is worse than missing a data block. We should know the difference

Test Plan: Tested on a local instance. All numbers looked reasonable.

Differential Revision: https://reviews.facebook.net/D55563
2016-03-30 11:41:11 -07:00
Enis Soztutar b3fe4ed16c HBASE-15412 Add average region size metric (Alicia Ying Shu) 2016-03-22 14:46:27 -07:00
Enis Soztutar 797562e6c3 HBASE-15464 Flush / Compaction metrics revisited 2016-03-21 17:50:02 -07:00
Enis Soztutar 51259fe4a5 HBASE-15377 Per-RS Get metric is time based, per-region metric is size-based (Heng Chen) 2016-03-15 11:22:18 -07:00
Enis Soztutar a979d85582 HBASE-15435 Add WAL (in bytes) written metric (Alicia Ying Shu) 2016-03-10 20:16:30 -08:00
chenheng f30afa05d9 HBASE-15376 ScanNext metric is size-based while every other per-operation metric is time based 2016-03-07 17:36:40 +08:00
Jonathan M Hsieh f658f3ef83 HBASE-15356 Remove unused imports (Youngjoon Kim) 2016-03-03 11:42:38 -08:00