HBASE-2692 Master rewrite and cleanup for 0.90 -- added in Jon's documentation ohow region transition works now

git-svn-id: https://svn.apache.org/repos/asf/hbase/trunk@991435 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
Michael Stack 2010-09-01 04:59:39 +00:00
parent dff733cd3a
commit 85babe8fe7

View File

@ -7,14 +7,11 @@
xmlns:html="http://www.w3.org/1999/xhtml"
xmlns:db="http://docbook.org/ns/docbook">
<info>
<title>HBase Book
<?eval ${project.version}?>
</title>
<title>HBase Book <?eval ${project.version}?></title>
</info>
<chapter xml:id="getting_started">
<title >Getting Started</title>
<title>Getting Started</title>
<section>
<title>Requirements</title>
@ -64,4 +61,580 @@
<para></para>
</chapter>
<chapter>
<title>Regions</title>
<para>This chapter is all about Regions.</para>
<note>
<para>TODO: Review all of the below to ensure it matches what was
committed -- St.Ack 20100901</para>
</note>
<section>
<title>Region Transitions</title>
<para>Regions only transition in a limited set of circumstances.</para>
<section>
<title>Cluster Startup</title>
<para>During cluster startup, the Master will know that it is a
cluster startup and do a bulk assignment.</para>
<note>
<para>This should take HDFS block locations into account.</para>
</note>
<itemizedlist>
<listitem>
<para>Master startup determines whether this is startup or
failover by counting the number of RegionServer nodes in ZooKeeper.</para>
</listitem>
<listitem>
<para>Master waits for the minimum number of RegionServers to be
available to be assigned regions.</para>
</listitem>
<listitem>
<para>Master clears out anything in the <filename>/unassigned</filename> directory in ZooKeeper.</para>
</listitem>
<listitem>
<para>Master randomly assigns out <constant>-ROOT-</constant> and
then <constant>.META.</constant>.</para>
</listitem>
<listitem>
<para>Master determines a bulk assignment plan via the
<classname>LoadBalancer</classname></para>
</listitem>
<listitem>
<para>Master stores the plan in the
<classname>AssignmentManager</classname>.</para>
</listitem>
<listitem>
<para>Master creates <code>OFFLINE</code> ZooKeeper nodes in
<filename>/unassigned</filename> for every region.</para>
</listitem>
<listitem>
<para>Master sends RPCs to each RegionServer, telling them to
<code>OPEN</code> their regions.</para>
</listitem>
</itemizedlist>
<para>All special cluster startup logic ends here.</para>
<note>
<para>So what can go wrong?</para>
<itemizedlist>
<listitem>
<para>We assume that the Master will not fail until after the
<code>OFFLINE</code> nodes have been created in ZK. RegionServers can fail at
any time.</para>
</listitem>
<listitem>
<para>If an RS fails at some point during this process, normal
region open/opening/opened handling will take care of it.</para>
<para>If the RS successfully opened a region, then it will be
taken care of in the normal RS failure handling.</para>
<para>If the RS did not successfully open a region, the
RegionManager or MasterPlanner will notice that the OFFLINE (or
OPENING) node in ZK has not been updated. This will trigger a
re-assignment to a different server. This logic is not special
to startup, all assignments will eventually time out if the
destination server never proceeds.</para>
</listitem>
<listitem>
<para>If the Master fails (after creating the ZK nodes), the
failed-over Master will see all of the regions in transition. It
will handle it in the same way any failed-over Master will
handle existing regions in transition.</para>
</listitem>
</itemizedlist>
</note>
</section>
<section>
<title>Load Balancing</title>
<para> Periodically, and when there are not any regions in transition,
a load balancer will run and move regions around to balance cluster
load.</para>
<itemizedlist>
<listitem>
<para>Periodic timer expires initializing a load balance (Load
Balancer is an instance of <classname>Chore</classname>).</para>
</listitem>
<listitem>
<para>Currently if regions in transition, load balancer goes back
to sleep.</para>
<note>
<para>Should it block until there are no regions in
transition.</para>
</note>
</listitem>
<listitem>
<para> The <classname>AssignmentManager</classname> determines a
balancing plan via the LoadBalancer.</para>
</listitem>
<listitem>
<para> Master stores the plan in the
<classname>AssignmentMaster</classname> store of
<classname>RegionPlan</classname>s</para>
</listitem>
<listitem>
<para> Master sends RPCs to the source RSs, telling them to
<code>CLOSE</code> the regions.</para>
</listitem>
</itemizedlist>
<para>That is it for the initial part of the load balance. Further
steps will be executed following event-triggers from ZK or timeouts if
closes run too long. It's not clear what to do in the case of a
long-running CLOSE besides ask again.</para>
<itemizedlist>
<listitem>
<para> RS receives CLOSE RPC, changes to CLOSING, and begins
closing the region.</para>
</listitem>
<listitem>
<para>Master sees that region is now CLOSING but does
nothing.</para>
</listitem>
<listitem>
<para>RS closes region and changes ZK node to CLOSED.</para>
</listitem>
<listitem>
<para>Master sees that region is now CLOSED.</para>
</listitem>
<listitem>
<para>Master looks at the plan for the specified region to figure
out the desired destination server.</para>
</listitem>
<listitem>
<para>Master sends an RPC to the destination RS telling it to OPEN
the region.</para>
</listitem>
<listitem>
<para>RS receives OPEN RPC, changes to OPENING, and begins opening
the region.</para>
</listitem>
<listitem>
<para>Master sees that region is now OPENING but does
nothing.</para>
</listitem>
<listitem>
<para>RS opens region and changes ZK node to OPENED. Edits .META.
updating the regions location.</para>
</listitem>
<listitem>
<para>Master sees that region is now OPENED.</para>
</listitem>
<listitem>
<para>Master removes the region from all in-memory
structures.</para>
</listitem>
<listitem>
<para>Master deletes the OPENED node from ZK.</para>
</listitem>
</itemizedlist>
<para>The Master or RSs can fail during this process. There is nothing
special about handling regions in transition due to load balancing so
consult the descriptions below for how this is handled.</para>
</section>
<section>
<title>Table Enable/Disable</title>
<para> Users can enable and disable tables manually. This is done to
make config changes to tables, drop tables, etc...</para>
<note>
<para>Because all failover logic is designed to ensure assignment of
all regions in transition, these operations will not properly ride
over Master or RegionServer failures. Since these are
client-triggered operations, this should be okay for the initial
master design. Moving forward, a special node could be put in ZK to
denote that a enable/disable has been requested. Another option is
to persist region movement plans into ZK instead of just in-memory.
In that case, an empty destination would signal that the region
should not be reopened after being closed.</para>
</note>
<section>
<title>Disable</title>
<itemizedlist>
<listitem>
<para>Client sends Master an RPC to disable a table.</para>
</listitem>
<listitem>
<para>Master finds all regions of the table.</para>
</listitem>
<listitem>
<para>Master stores the plan (do not re-open the regions once
closed).</para>
</listitem>
<listitem>
<para>Master sends RPCs to RSs to close all the regions of the
table.</para>
</listitem>
<listitem>
<para>RS receives CLOSE RPC, creates ZK node in CLOSING state,
and begins closing the region.</para>
</listitem>
<listitem>
<para>Master sees that region is now CLOSING but does
nothing.</para>
</listitem>
<listitem>
<para>RS closes region and changes ZK node to CLOSED.</para>
</listitem>
<listitem>
<para>Master sees that region is now CLOSED.</para>
</listitem>
<listitem>
<para>Master looks at the plan for the specified region and sees
that it should not reopen.</para>
</listitem>
<listitem>
<para>Master deletes the unassigned znode. It is no longer
responsible for ensuring assignment/availability of this
region.</para>
</listitem>
</itemizedlist>
<section>
<title>Enable</title>
<itemizedlist>
<listitem>
<para>Client sends Master an RPC to disable a table.</para>
</listitem>
<listitem>
<para>Master finds all regions of the table.</para>
</listitem>
<listitem>
<para>Master creates an unassigned node in an OFFLINE state
for each region.</para>
</listitem>
<listitem>
<para>Master sends RPCs to RSs to open all the regions of the
table.</para>
</listitem>
<listitem>
<para>RS receives OPEN RPC, transitions ZK node to OPENING
state, and begins opening the region.</para>
</listitem>
<listitem>
<para>Master sees that region is now OPENING but does
nothing.</para>
</listitem>
<listitem>
<para>RS opens region and changes ZK node to OPENED.</para>
</listitem>
<listitem>
<para>Master sees that region is now OPENED.</para>
</listitem>
<listitem>
<para>Master deletes the unassigned znode.</para>
</listitem>
</itemizedlist>
</section>
</section>
</section>
<section>
<title>RegionServer Failure</title>
<itemizedlist>
<listitem>
<para>Master is alerted via ZK that an RS ephemeral node is
gone.</para>
</listitem>
<listitem>
<para>Master begins RS failure process.</para>
</listitem>
<listitem>
<para>Master determines which regions need to be handled.</para>
</listitem>
<listitem>
<para>Master in-memory state shows all regions currently assigned
to the dead RS.</para>
</listitem>
<listitem>
<para>Master in-memory plans show any regions that were in
transitioning to the dead RS.</para>
</listitem>
<listitem>
<para>With list of regions, Master now forces assignment of all
regions to other RSs.</para>
</listitem>
<listitem>
<para>Master creates or force updates all existing ZK unassigned
nodes to be OFFLINE.</para>
</listitem>
<listitem>
<para>Master sends RPCs to RSs to open all the regions.</para>
</listitem>
<listitem>
<para>Normal operations from here on.</para>
</listitem>
</itemizedlist>
<para>There are some complexities here. For regions in transition that
were somehow involved with the dead RS, these could be in any of the 5
states in ZK.</para>
<itemizedlist>
<listitem>
<para> <code>OFFLINE</code> Generate a new assignment and send an
OPEN RPC.</para>
</listitem>
<listitem>
<para> <code>CLOSING</code> If the failed RS is the source, we
overwrite the state to OFFLINE, generate a new assignment, and
send an OPEN RPC. If the failed RS is the destination, we
overwrite the state to OFFLINE and send an OPEN RPC to the
original destination. If for some reason we don't have an existing
plan (concurrent Master failure), generate a new assignment and
send an OPEN RPC.</para>
</listitem>
<listitem>
<para><code>CLOSED</code> If the failed RS is the source, we can
safely ignore this. The normal ZK event handling should deal with
this. If the failed RS is the destination, we generate a new
assignment and send an OPEN RPC.</para>
</listitem>
<listitem>
<para> OPENING or OPENED If the failed RS was the original source,
ignore. If the failed RS is the destination, we overwrite the
state to OFFLINE, generate a new assignment, and send an OPEN
RPC.</para>
</listitem>
</itemizedlist>
<para>In all of these cases, it is important to note that the
transitions on the RS side ensure only a single RS ever successfully
completes a transition. This is done by reading the current state,
verifying it is expected, and then issuing the update with the version
number of the read value. If multiple RSs are attempting this
operation, exactly one can succeed.</para>
</section>
<section>
<title>Master Failover</title>
<itemizedlist>
<listitem>
<para>Master initializes and finds out that he is a failed-over
Master.</para>
</listitem>
<listitem>
<para>Before Master starts up the normal handlers for region
transitions he grabs all nodes in /unassigned.</para>
</listitem>
<listitem>
<para>If no regions are in transition, failover is done and he
continues.</para>
</listitem>
<listitem>
<para>If regions are in transition, each will be handled according
to the current region state in ZK.</para>
</listitem>
<listitem>
<para> Before processing the regions in transition, the normal
handlers start to ensure we don't miss any transitions. The
handling of opens on the RS side ensures we don't dupe assign even
if things have changed before we finish acting on
them.<itemizedlist>
<listitem>
<para>OFFLINE Generate a new assignment and send an OPEN
RPC.</para>
</listitem>
<listitem>
<para>CLOSING Nothing to be done. Normal handlers take care
of timeouts.</para>
</listitem>
<listitem>
<para>CLOSED Generate a new assignment and send an OPEN
RPC.</para>
</listitem>
<listitem>
<para>OPENING Nothing to be done. Normal handlers take care
of timeouts.</para>
</listitem>
<listitem>
<para>OPENED Delete the node from ZK. Region was
successfully opened but the previous Master did not
acknowledge it.</para>
</listitem>
</itemizedlist></para>
</listitem>
<listitem>
<para>Once this is done, everything further is dealt with as
normal by the RegionManager.</para>
</listitem>
</itemizedlist>
</section>
<section>
<title>Summary of Region Transition States</title>
<note>
<para>Check below is complete -- St.Ack 20100901</para>
</note>
<section>
<title>Master</title>
<itemizedlist>
<listitem>
<para>Master creates an unassigned node as OFFLINE.</para>
<para>Cluster startup and table enabling.</para>
</listitem>
<listitem>
<para>Master forces an existing unassigned node to
OFFLINE.</para>
<para>RegionServer failure.</para>
<para>Allows transitions from all states to OFFLINE.</para>
</listitem>
<listitem>
<para>Master deletes an unassigned node that was in a OPENED
state.</para>
<para>Normal region transitions. Besides cluster startup, no
other deletions of unassigned nodes is allowed.</para>
</listitem>
<listitem>
<para>Master deletes all unassigned nodes regardless of
state.</para>
<para>Cluster startup before any assignment happens.</para>
</listitem>
</itemizedlist>
</section>
<section>
<title>RegionServer</title>
<itemizedlist>
<listitem>
<para> RegionServer creates an unassigned node as
CLOSING.</para>
<para>All region closes will do this in response to a CLOSE RPC
from Master. </para>
<para>A node can never be transitioned to CLOSING, only
created.</para>
</listitem>
<listitem>
<para>RegionServer transitions an unassigned node from CLOSING
to CLOSED.</para>
<para>Normal region closes. CAS operation.</para>
</listitem>
<listitem>
<para>RegionServer transitions an unassigned node from OFFLINE
to OPENING.</para>
<para>All region opens will do this in response to an OPEN RPC
from the Master.</para>
<para>Normal region opens. CAS operation.</para>
</listitem>
<listitem>
<para>RegionServer transitions an unassigned node from OPENING
to OPENED.</para>
<para>Normal region opens. CAS operation.</para>
</listitem>
</itemizedlist>
</section>
</section>
</section>
</chapter>
<appendix>
<title></title>
<para></para>
</appendix>
</book>