HBASE-11522 Move Replication information into the Ref Guide (Misty Stanley-Jones)
This commit is contained in:
parent
ff655e04d1
commit
fe54e7d7ae
|
@ -1107,19 +1107,491 @@ false
|
||||||
<section
|
<section
|
||||||
xml:id="cluster_replication">
|
xml:id="cluster_replication">
|
||||||
<title>Cluster Replication</title>
|
<title>Cluster Replication</title>
|
||||||
<para>See <link
|
<note>
|
||||||
|
<para>This information was previously available at <link
|
||||||
xlink:href="http://hbase.apache.org/replication.html">Cluster Replication</link>. </para>
|
xlink:href="http://hbase.apache.org/replication.html">Cluster Replication</link>. </para>
|
||||||
|
</note>
|
||||||
|
<para>HBase provides a replication mechanism to copy data between HBase
|
||||||
|
clusters. Replication can be used as a disaster recovery solution and as a mechanism for high
|
||||||
|
availability. You can also use replication to separate web-facing operations from back-end
|
||||||
|
jobs such as MapReduce.</para>
|
||||||
|
|
||||||
|
<para>In terms of architecture, HBase replication is master-push. This takes advantage of the
|
||||||
|
fact that each region server has its own write-ahead log (WAL). One master cluster can
|
||||||
|
replicate to any number of slave clusters, and each region server replicates its own stream of
|
||||||
|
edits. For more information on the different properties of master/slave replication and other
|
||||||
|
types of replication, see the article <link
|
||||||
|
xlink:href="http://highscalability.com/blog/2009/8/24/how-google-serves-data-from-multiple-datacenters.html">How
|
||||||
|
Google Serves Data From Multiple Datacenters</link>.</para>
|
||||||
|
|
||||||
|
<para>Replication is asynchronous, allowing clusters to be geographically distant or to have
|
||||||
|
some gaps in availability. This also means that data between master and slave clusters will
|
||||||
|
not be instantly consistent. Rows inserted on the master are not immediately available or
|
||||||
|
consistent with rows on the slave clusters. rows inserted on the master cluster won’t be
|
||||||
|
available at the same time on the slave clusters. The goal is eventual consistency. </para>
|
||||||
|
|
||||||
|
<para>The replication format used in this design is conceptually the same as the <firstterm><link
|
||||||
|
xlink:href="http://dev.mysql.com/doc/refman/5.1/en/replication-formats.html">statement-based
|
||||||
|
replication</link></firstterm> design used by MySQL. Instead of SQL statements, entire
|
||||||
|
WALEdits (consisting of multiple cell inserts coming from Put and Delete operations on the
|
||||||
|
clients) are replicated in order to maintain atomicity. </para>
|
||||||
|
|
||||||
|
<para>The WALs for each region server must be kept in HDFS as long as they are needed to
|
||||||
|
replicate data to any slave cluster. Each region server reads from the oldest log it needs to
|
||||||
|
replicate and keeps track of the current position inside ZooKeeper to simplify failure
|
||||||
|
recovery. That position, as well as the queue of WALs to process, may be different for every
|
||||||
|
slave cluster.</para>
|
||||||
|
|
||||||
|
<para>The clusters participating in replication can be of different sizes. The master
|
||||||
|
cluster relies on randomization to attempt to balance the stream of replication on the slave clusters</para>
|
||||||
|
|
||||||
|
<para>HBase supports master/master and cyclic replication as well as replication to multiple
|
||||||
|
slaves.</para>
|
||||||
|
|
||||||
|
<figure>
|
||||||
|
<title>Replication Architecture Overview</title>
|
||||||
|
<mediaobject>
|
||||||
|
<imageobject>
|
||||||
|
<imagedata fileref="replication_overview.png" />
|
||||||
|
</imageobject>
|
||||||
|
<textobject>
|
||||||
|
<para>Illustration of the replication architecture in HBase, as described in the prior
|
||||||
|
text.</para>
|
||||||
|
</textobject>
|
||||||
|
</mediaobject>
|
||||||
|
</figure>
|
||||||
|
|
||||||
|
<formalpara>
|
||||||
|
<title>Enabling and Configuring Replication</title>
|
||||||
|
<para>See the <link
|
||||||
|
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/replication/package-summary.html#requirements">
|
||||||
|
API documentation for replication</link> for information on enabling and configuring
|
||||||
|
replication.</para>
|
||||||
|
</formalpara>
|
||||||
|
|
||||||
|
<section>
|
||||||
|
<title>Life of a WAL Edit</title>
|
||||||
|
<para>A single WAL edit goes through several steps in order to be replicated to a slave
|
||||||
|
cluster.</para>
|
||||||
|
|
||||||
|
<orderedlist>
|
||||||
|
<title>When the slave responds correctly:</title>
|
||||||
|
<listitem>
|
||||||
|
<para>A HBase client uses a Put or Delete operation to manipulate data in HBase.</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>The region server writes the request to the WAL in a way that would allow it to be
|
||||||
|
replayed if it were not written successfully.</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>If the changed cell corresponds to a column family that is scoped for replication,
|
||||||
|
the edit is added to the queue for replication.</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>In a separate thread, the edit is read from the log, as part of a batch process.
|
||||||
|
Only the KeyValues that are eligible for replication are kept. Replicable KeyValues are
|
||||||
|
part of a column family whose schema is scoped GLOBAL, are not part of a catalog such as
|
||||||
|
<code>hbase:meta</code>, and did not originate from the target slave cluster, in the
|
||||||
|
case of cyclic replication.</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>The edit is tagged with the master's UUID and added to a buffer. When the buffer is
|
||||||
|
filled, or the reader reaches the end of the file, the buffer is sent to a random region
|
||||||
|
server on the slave cluster.</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>The region server reads the edits sequentially and separates them into buffers, one
|
||||||
|
buffer per table. After all edits are read, each buffer is flushed using <link
|
||||||
|
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html"
|
||||||
|
>HTable</link>, HBase's normal client. The master's UUID is preserved in the edits
|
||||||
|
they are applied, in order to allow for cyclic replication.</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>In the master, the offset for the WAL that is currently being replicated is
|
||||||
|
registered in ZooKeeper.</para>
|
||||||
|
</listitem>
|
||||||
|
</orderedlist>
|
||||||
|
<orderedlist>
|
||||||
|
<title>When the slave does not respond:</title>
|
||||||
|
<listitem>
|
||||||
|
<para>The first three steps, where the edit is inserted, are identical.</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>Again in a separate thread, the region server reads, filters, and edits the log
|
||||||
|
edits in the same way as above. The slave region server does not answer the RPC
|
||||||
|
call.</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>The master sleeps and tries again a configurable number of times.</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>If the slave region server is still not available, the master selects a new subset
|
||||||
|
of region server to replicate to, and tries again to send the buffer of edits.</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>Meanwhile, the WALs are rolled and stored in a queue in ZooKeeper. Logs that are
|
||||||
|
<firstterm>archived</firstterm> by their region server, by moving them from the region
|
||||||
|
server's log directory to a central log directory, will update their paths in the
|
||||||
|
in-memory queue of the replicating thread.</para>
|
||||||
|
</listitem>
|
||||||
|
<listitem>
|
||||||
|
<para>When the slave cluster is finally available, the buffer is applied in the same way
|
||||||
|
as during normal processing. The master region server will then replicate the backlog of
|
||||||
|
logs that accumulated during the outage.</para>
|
||||||
|
</listitem>
|
||||||
|
</orderedlist>
|
||||||
|
|
||||||
|
|
||||||
<note xml:id="cluster.replication.preserving.tags">
|
<note xml:id="cluster.replication.preserving.tags">
|
||||||
<title>Preserving Tags During Replication</title>
|
<title>Preserving Tags During Replication</title>
|
||||||
<para>By default, the codec used for replication between clusters strips tags, such as
|
<para>By default, the codec used for replication between clusters strips tags, such as
|
||||||
cell-level ACLs, from cells. To prevent the tags from being stripped, you can use a
|
cell-level ACLs, from cells. To prevent the tags from being stripped, you can use a
|
||||||
different codec which does not strip them. Configure
|
different codec which does not strip them. Configure
|
||||||
<code>hbase.replication.rpc.codec</code> to use
|
<code>hbase.replication.rpc.codec</code> to use
|
||||||
<literal>org.apache.hadoop.hbase.codec.KeyValueCodecWithTags</literal>, on both the source
|
<literal>org.apache.hadoop.hbase.codec.KeyValueCodecWithTags</literal>, on both the
|
||||||
and sink RegionServers involved in the replication. This option was introduced in <link
|
source and sink RegionServers involved in the replication. This option was introduced in
|
||||||
xlink:href="https://issues.apache.org/jira/browse/HBASE-10322">HBASE-10322</link>.</para>
|
<link xlink:href="https://issues.apache.org/jira/browse/HBASE-10322"
|
||||||
|
>HBASE-10322</link>.</para>
|
||||||
</note>
|
</note>
|
||||||
</section>
|
</section>
|
||||||
|
|
||||||
|
<section>
|
||||||
|
<title>Replication Internals</title>
|
||||||
|
<variablelist>
|
||||||
|
<varlistentry>
|
||||||
|
<term>Replication State in ZooKeeper</term>
|
||||||
|
<listitem>
|
||||||
|
<para>HBase replication maintains its state in ZooKeeper. By default, the state is
|
||||||
|
contained in the base node <filename>/hbase/replication</filename>. This node contains
|
||||||
|
two child nodes, the <code>Peers</code> znode and the <code>RS</code> znode.</para>
|
||||||
|
</listitem>
|
||||||
|
</varlistentry>
|
||||||
|
<varlistentry>
|
||||||
|
<term>The <code>Peers</code> Znode</term>
|
||||||
|
<listitem>
|
||||||
|
<para>The <code>peers</code> znode is stored in
|
||||||
|
<filename>/hbase/replication/peers</filename> by default. It consists of a list of
|
||||||
|
all peer replication clusters, along with the status of each of them. The value of
|
||||||
|
each peer is its cluster key, which is provided in the HBase Shell. The cluster key
|
||||||
|
contains a list of ZooKeeper nodes in the cluster's quorum, the client port for the
|
||||||
|
ZooKeeper quorum, and the base znode for HBase in HDFS on that cluster.</para>
|
||||||
|
<screen>
|
||||||
|
/hbase/replication/peers
|
||||||
|
/1 [Value: zk1.host.com,zk2.host.com,zk3.host.com:2181:/hbase]
|
||||||
|
/2 [Value: zk5.host.com,zk6.host.com,zk7.host.com:2181:/hbase]
|
||||||
|
</screen>
|
||||||
|
<para>Each peer has a child znode which indicates whether or not replication is enabled
|
||||||
|
on that cluster. These peer-state znodes do not contain any child znodes, but only
|
||||||
|
contain a Boolean value. This value is read and maintained by the
|
||||||
|
R<code>eplicationPeer.PeerStateTracker</code> class.</para>
|
||||||
|
<screen>
|
||||||
|
/hbase/replication/peers
|
||||||
|
/1/peer-state [Value: ENABLED]
|
||||||
|
/2/peer-state [Value: DISABLED]
|
||||||
|
</screen>
|
||||||
|
</listitem>
|
||||||
|
</varlistentry>
|
||||||
|
<varlistentry>
|
||||||
|
<term>The <code>RS</code> Znode</term>
|
||||||
|
<listitem>
|
||||||
|
<para>The <code>rs</code> znode contains a list of WAL logs which need to be replicated.
|
||||||
|
This list is divided into a set of queues organized by region server and the peer
|
||||||
|
cluster the region server is shipping the logs to. The rs znode has one child znode
|
||||||
|
for each region server in the cluster. The child znode name is the region server's
|
||||||
|
hostname, client port, and start code. This list includes both live and dead region
|
||||||
|
servers.</para>
|
||||||
|
<screen>
|
||||||
|
/hbase/replication/rs
|
||||||
|
/hostname.example.org,6020,1234
|
||||||
|
/hostname2.example.org,6020,2856
|
||||||
|
</screen>
|
||||||
|
<para>Each <code>rs</code> znode contains a list of WAL replication queues, one queue
|
||||||
|
for each peer cluster it replicates to. These queues are represented by child znodes
|
||||||
|
named by the cluster ID of the peer cluster they represent.</para>
|
||||||
|
<screen>
|
||||||
|
/hbase/replication/rs
|
||||||
|
/hostname.example.org,6020,1234
|
||||||
|
/1
|
||||||
|
/2
|
||||||
|
</screen>
|
||||||
|
<para>Each queue has one child znode for each WAL log that still needs to be replicated.
|
||||||
|
the value of these child znodes is the last position that was replicated. This
|
||||||
|
position is updated each time a WAL log is replicated.</para>
|
||||||
|
<screen>
|
||||||
|
/hbase/replication/rs
|
||||||
|
/hostname.example.org,6020,1234
|
||||||
|
/1
|
||||||
|
23522342.23422 [VALUE: 254]
|
||||||
|
12340993.22342 [VALUE: 0]
|
||||||
|
</screen>
|
||||||
|
</listitem>
|
||||||
|
</varlistentry>
|
||||||
|
</variablelist>
|
||||||
|
</section>
|
||||||
|
<section>
|
||||||
|
<title>Replication Configuration Options</title>
|
||||||
|
<informaltable>
|
||||||
|
<tgroup cols="3">
|
||||||
|
<thead>
|
||||||
|
<row>
|
||||||
|
<entry>Option</entry>
|
||||||
|
<entry>Description</entry>
|
||||||
|
<entry>Default</entry>
|
||||||
|
</row>
|
||||||
|
</thead>
|
||||||
|
<tbody>
|
||||||
|
<row>
|
||||||
|
<entry><para><code>zookeeper.znode.parent</code></para></entry>
|
||||||
|
<entry><para>The name of the base ZooKeeper znode used for HBase</para></entry>
|
||||||
|
<entry><para><literal>/hbase</literal></para></entry>
|
||||||
|
</row>
|
||||||
|
<row>
|
||||||
|
<entry><para><code>zookeeper.znode.replication</code></para></entry>
|
||||||
|
<entry><para>The name of the base znode used for replication</para></entry>
|
||||||
|
<entry><para><literal>replication</literal></para></entry>
|
||||||
|
</row>
|
||||||
|
<row>
|
||||||
|
<entry><para><code>zookeeper.znode.replication.peers</code></para></entry>
|
||||||
|
<entry><para>The name of the <code>peer</code> znode</para></entry>
|
||||||
|
<entry><para><literal>peers</literal></para></entry>
|
||||||
|
</row>
|
||||||
|
<row>
|
||||||
|
<entry><para><code>zookeeper.znode.replication.peers.state</code></para></entry>
|
||||||
|
<entry><para>The name of <code>peer-state</code> znode</para></entry>
|
||||||
|
<entry><para><literal>peer-state</literal></para></entry>
|
||||||
|
</row>
|
||||||
|
<row>
|
||||||
|
<entry><para><code>zookeeper.znode.replication.rs</code></para></entry>
|
||||||
|
<entry><para>The name of the <code>rs</code> znode</para></entry>
|
||||||
|
<entry><para><literal>rs</literal></para></entry>
|
||||||
|
</row>
|
||||||
|
<row>
|
||||||
|
<entry><para><code>hbase.replication</code></para></entry>
|
||||||
|
<entry><para>Whether replication is enabled or disabled on a given cluster</para></entry>
|
||||||
|
<entry><para><literal>false</literal></para></entry>
|
||||||
|
</row>
|
||||||
|
<row>
|
||||||
|
<entry><para><code>eplication.sleep.before.failover</code></para></entry>
|
||||||
|
<entry><para>How many milliseconds a worker should sleep before attempting to replicate
|
||||||
|
a dead region server's WAL queues.</para></entry>
|
||||||
|
<entry><para><literal></literal></para></entry>
|
||||||
|
</row>
|
||||||
|
<row>
|
||||||
|
<entry><para><code>replication.executor.workers</code></para></entry>
|
||||||
|
<entry><para>The number of region servers a given region server should attempt to
|
||||||
|
failover simultaneously.</para></entry>
|
||||||
|
<entry><para><literal>1</literal></para></entry>
|
||||||
|
</row>
|
||||||
|
</tbody>
|
||||||
|
</tgroup>
|
||||||
|
</informaltable>
|
||||||
|
</section>
|
||||||
|
|
||||||
|
<section>
|
||||||
|
<title>Replication Implementation Details</title>
|
||||||
|
<formalpara>
|
||||||
|
<title>Choosing Region Servers to Replicate To</title>
|
||||||
|
<para>When a master cluster region server initiates a replication source to a slave cluster,
|
||||||
|
it first connects to the slave's ZooKeeper ensemble using the provided cluster key . It
|
||||||
|
then scans the <filename>rs/</filename> directory to discover all the available sinks
|
||||||
|
(region servers that are accepting incoming streams of edits to replicate) and randomly
|
||||||
|
chooses a subset of them using a configured ratio which has a default value of 10%. For
|
||||||
|
example, if a slave cluster has 150 machines, 15 will be chosen as potential recipient for
|
||||||
|
edits that this master cluster region server sends. Because this selection is performed by
|
||||||
|
each master region server, the probability that all slave region servers are used is very
|
||||||
|
high, and this method works for clusters of any size. For example, a master cluster of 10
|
||||||
|
machines replicating to a slave cluster of 5 machines with a ratio of 10% causes the
|
||||||
|
master cluster region servers to choose one machine each at random.</para>
|
||||||
|
</formalpara>
|
||||||
|
<para>A ZooKeeper watcher is placed on the
|
||||||
|
<filename>${<replaceable>zookeeper.znode.parent</replaceable>}/rs</filename> node of the
|
||||||
|
slave cluster by each of the master cluster's region servers. This watch is used to monitor
|
||||||
|
changes in the composition of the slave cluster. When nodes are removed from the slave
|
||||||
|
cluster, or if nodes go down or come back up, the master cluster's region servers will
|
||||||
|
respond by selecting a new pool of slave region servers to replicate to.</para>
|
||||||
|
|
||||||
|
<formalpara>
|
||||||
|
<title>Keeping Track of Logs</title>
|
||||||
|
|
||||||
|
<para>Each master cluster region server has its own znode in the replication znodes
|
||||||
|
hierarchy. It contains one znode per peer cluster (if 5 slave clusters, 5 znodes are
|
||||||
|
created), and each of these contain a queue of WALs to process. Each of these queues will
|
||||||
|
track the WALs created by that region server, but they can differ in size. For example, if
|
||||||
|
one slave cluster becomes unavailable for some time, the WALs should not be deleted, so
|
||||||
|
they need to stay in the queue while the others are processed. See <xref
|
||||||
|
linkend="rs.failover.details"/> for an example.</para>
|
||||||
|
</formalpara>
|
||||||
|
<para>When a source is instantiated, it contains the current WAL that the region server is
|
||||||
|
writing to. During log rolling, the new file is added to the queue of each slave cluster's
|
||||||
|
znode just before it is made available. This ensures that all the sources are aware that a
|
||||||
|
new log exists before the region server is able to append edits into it, but this operations
|
||||||
|
is now more expensive. The queue items are discarded when the replication thread cannot read
|
||||||
|
more entries from a file (because it reached the end of the last block) and there are other
|
||||||
|
files in the queue. This means that if a source is up to date and replicates from the log
|
||||||
|
that the region server writes to, reading up to the "end" of the current file will not
|
||||||
|
delete the item in the queue.</para>
|
||||||
|
<para>A log can be archived if it is no longer used or if the number of logs exceeds
|
||||||
|
<code>hbase.regionserver.maxlogs</code> because the insertion rate is faster than regions
|
||||||
|
are flushed. When a log is archived, the source threads are notified that the path for that
|
||||||
|
log changed. If a particular source has already finished with an archived log, it will just
|
||||||
|
ignore the message. If the log is in the queue, the path will be updated in memory. If the
|
||||||
|
log is currently being replicated, the change will be done atomically so that the reader
|
||||||
|
doesn't attempt to open the file when has already been moved. Because moving a file is a
|
||||||
|
NameNode operation , if the reader is currently reading the log, it won't generate any
|
||||||
|
exception.</para>
|
||||||
|
<formalpara>
|
||||||
|
<title>Reading, Filtering and Sending Edits</title>
|
||||||
|
<para>By default, a source attempts to read from a WAL and ship log entries to a sink as
|
||||||
|
quickly as possible. Speed is limited by the filtering of log entries Only KeyValues that
|
||||||
|
are scoped GLOBAL and that do not belong to catalog tables will be retained. Speed is also
|
||||||
|
limited by total size of the list of edits to replicate per slave, which is limited to 64
|
||||||
|
MB by default. With this configuration, a master cluster region server with three slaves
|
||||||
|
would use at most 192 MB to store data to replicate. This does not account for the data
|
||||||
|
which was filtered but not garbage collected.</para>
|
||||||
|
</formalpara>
|
||||||
|
<para>Once the maximum size of edits has been buffered or the reader reaces the end of the
|
||||||
|
WAL, the source thread stops reading and chooses at random a sink to replicate to (from the
|
||||||
|
list that was generated by keeping only a subset of slave region servers). It directly
|
||||||
|
issues a RPC to the chosen region server and waits for the method to return. If the RPC was
|
||||||
|
successful, the source determines whether the current file has been emptied or it contains
|
||||||
|
more data which needs to be read. If the file has been emptied, the source deletes the znode
|
||||||
|
in the queue. Otherwise, it registers the new offset in the log's znode. If the RPC threw an
|
||||||
|
exception, the source will retry 10 times before trying to find a different sink.</para>
|
||||||
|
<formalpara>
|
||||||
|
<title>Cleaning Logs</title>
|
||||||
|
<para>If replication is not enabled, the master's log-cleaning thread deletes old logs using
|
||||||
|
a configured TTL. This TTL-based method does not work well with replication, because
|
||||||
|
archived logs which have exceeded their TTL may still be in a queue. The default behavior
|
||||||
|
is augmented so that if a log is past its TTL, the cleaning thread looks up every queue
|
||||||
|
until it finds the log, while caching queues it has found. If the log is not found in any
|
||||||
|
queues, the log will be deleted. The next time the cleaning process needs to look for a
|
||||||
|
log, it starts by using its cached list.</para>
|
||||||
|
</formalpara>
|
||||||
|
<formalpara xml:id="rs.failover.details">
|
||||||
|
<title>Region Server Failover</title>
|
||||||
|
<para>When no region servers are failing, keeping track of the logs in ZooKeeper adds no
|
||||||
|
value. Unfortunately, region servers do fail, and since ZooKeeper is highly available, it
|
||||||
|
is useful for managing the transfer of the queues in the event of a failure.</para>
|
||||||
|
</formalpara>
|
||||||
|
<para>Each of the master cluster region servers keeps a watcher on every other region server,
|
||||||
|
in order to be notified when one dies (just as the master does). When a failure happens,
|
||||||
|
they all race to create a znode called <literal>lock</literal> inside the dead region
|
||||||
|
server's znode that contains its queues. The region server that creates it successfully then
|
||||||
|
transfers all the queues to its own znode, one at a time since ZooKeeper does not support
|
||||||
|
renaming queues. After queues are all transferred, they are deleted from the old location.
|
||||||
|
The znodes that were recovered are renamed with the ID of the slave cluster appended with
|
||||||
|
the name of the dead server.</para>
|
||||||
|
<para>Next, the master cluster region server creates one new source thread per copied queue,
|
||||||
|
and each of the source threads follows the read/filter/ship pattern. The main difference is
|
||||||
|
that those queues will never receive new data, since they do not belong to their new region
|
||||||
|
server. When the reader hits the end of the last log, the queue's znode is deleted and the
|
||||||
|
master cluster region server closes that replication source.</para>
|
||||||
|
<para>Given a master cluster with 3 region servers replicating to a single slave with id
|
||||||
|
<literal>2</literal>, the following hierarchy represents what the znodes layout could be
|
||||||
|
at some point in time. The region servers' znodes all contain a <literal>peers</literal>
|
||||||
|
znode which contains a single queue. The znode names in the queues represent the actual file
|
||||||
|
names on HDFS in the form
|
||||||
|
<literal><replaceable>address</replaceable>,<replaceable>port</replaceable>.<replaceable>timestamp</replaceable></literal>.</para>
|
||||||
|
<screen>
|
||||||
|
/hbase/replication/rs/
|
||||||
|
1.1.1.1,60020,123456780/
|
||||||
|
2/
|
||||||
|
1.1.1.1,60020.1234 (Contains a position)
|
||||||
|
1.1.1.1,60020.1265
|
||||||
|
1.1.1.2,60020,123456790/
|
||||||
|
2/
|
||||||
|
1.1.1.2,60020.1214 (Contains a position)
|
||||||
|
1.1.1.2,60020.1248
|
||||||
|
1.1.1.2,60020.1312
|
||||||
|
1.1.1.3,60020, 123456630/
|
||||||
|
2/
|
||||||
|
1.1.1.3,60020.1280 (Contains a position)
|
||||||
|
</screen>
|
||||||
|
<para>Assume that 1.1.1.2 loses its ZooKeeper session. The survivors will race to create a
|
||||||
|
lock, and, arbitrarily, 1.1.1.3 wins. It will then start transferring all the queues to its
|
||||||
|
local peers znode by appending the name of the dead server. Right before 1.1.1.3 is able to
|
||||||
|
clean up the old znodes, the layout will look like the following:</para>
|
||||||
|
<screen>
|
||||||
|
/hbase/replication/rs/
|
||||||
|
1.1.1.1,60020,123456780/
|
||||||
|
2/
|
||||||
|
1.1.1.1,60020.1234 (Contains a position)
|
||||||
|
1.1.1.1,60020.1265
|
||||||
|
1.1.1.2,60020,123456790/
|
||||||
|
lock
|
||||||
|
2/
|
||||||
|
1.1.1.2,60020.1214 (Contains a position)
|
||||||
|
1.1.1.2,60020.1248
|
||||||
|
1.1.1.2,60020.1312
|
||||||
|
1.1.1.3,60020,123456630/
|
||||||
|
2/
|
||||||
|
1.1.1.3,60020.1280 (Contains a position)
|
||||||
|
|
||||||
|
2-1.1.1.2,60020,123456790/
|
||||||
|
1.1.1.2,60020.1214 (Contains a position)
|
||||||
|
1.1.1.2,60020.1248
|
||||||
|
1.1.1.2,60020.1312
|
||||||
|
</screen>
|
||||||
|
<para>Some time later, but before 1.1.1.3 is able to finish replicating the last WAL from
|
||||||
|
1.1.1.2, it dies too. Some new logs were also created in the normal queues. The last region
|
||||||
|
server will then try to lock 1.1.1.3's znode and will begin transferring all the queues. The
|
||||||
|
new layout will be:</para>
|
||||||
|
<screen>
|
||||||
|
/hbase/replication/rs/
|
||||||
|
1.1.1.1,60020,123456780/
|
||||||
|
2/
|
||||||
|
1.1.1.1,60020.1378 (Contains a position)
|
||||||
|
|
||||||
|
2-1.1.1.3,60020,123456630/
|
||||||
|
1.1.1.3,60020.1325 (Contains a position)
|
||||||
|
1.1.1.3,60020.1401
|
||||||
|
|
||||||
|
2-1.1.1.2,60020,123456790-1.1.1.3,60020,123456630/
|
||||||
|
1.1.1.2,60020.1312 (Contains a position)
|
||||||
|
1.1.1.3,60020,123456630/
|
||||||
|
lock
|
||||||
|
2/
|
||||||
|
1.1.1.3,60020.1325 (Contains a position)
|
||||||
|
1.1.1.3,60020.1401
|
||||||
|
|
||||||
|
2-1.1.1.2,60020,123456790/
|
||||||
|
1.1.1.2,60020.1312 (Contains a position)
|
||||||
|
</screen>
|
||||||
|
<formalpara>
|
||||||
|
<title>Replication Metrics</title>
|
||||||
|
<para>The following metrics are exposed at the global region server level and (since HBase
|
||||||
|
0.95) at the peer level:</para>
|
||||||
|
</formalpara>
|
||||||
|
<variablelist>
|
||||||
|
<varlistentry>
|
||||||
|
<term><code>source.sizeOfLogQueue</code></term>
|
||||||
|
<listitem>
|
||||||
|
<para>number of WALs to process (excludes the one which is being processed) at the
|
||||||
|
Replication source</para>
|
||||||
|
</listitem>
|
||||||
|
</varlistentry>
|
||||||
|
<varlistentry>
|
||||||
|
<term><code>source.shippedOps</code></term>
|
||||||
|
<listitem>
|
||||||
|
<para>number of mutations shipped</para>
|
||||||
|
</listitem>
|
||||||
|
</varlistentry>
|
||||||
|
<varlistentry>
|
||||||
|
<term><code>source.logEditsRead</code></term>
|
||||||
|
<listitem>
|
||||||
|
<para>number of mutations read from HLogs at the replication source</para>
|
||||||
|
</listitem>
|
||||||
|
</varlistentry>
|
||||||
|
<varlistentry>
|
||||||
|
<term><code>source.ageOfLastShippedOp</code></term>
|
||||||
|
<listitem>
|
||||||
|
<para>age of last batch that was shipped by the replication source</para>
|
||||||
|
</listitem>
|
||||||
|
</varlistentry>
|
||||||
|
</variablelist>
|
||||||
|
|
||||||
|
</section>
|
||||||
|
</section>
|
||||||
<section
|
<section
|
||||||
xml:id="ops.backup">
|
xml:id="ops.backup">
|
||||||
<title>HBase Backup</title>
|
<title>HBase Backup</title>
|
||||||
|
|
|
@ -26,520 +26,6 @@
|
||||||
</title>
|
</title>
|
||||||
</properties>
|
</properties>
|
||||||
<body>
|
<body>
|
||||||
<section name="Overview">
|
<p>This information has been moved to <a href="http://hbase.apache.org/book.html#cluster_replication">the Cluster Replication</a> section of the <a href="http://hbase.apache.org/book.html">Apache HBase Reference Guide</a>.</p>
|
||||||
<p>
|
|
||||||
The replication feature of Apache HBase (TM) provides a way to copy data between HBase deployments. It
|
|
||||||
can serve as a disaster recovery solution and can contribute to provide
|
|
||||||
higher availability at the HBase layer. It can also serve more practically;
|
|
||||||
for example, as a way to easily copy edits from a web-facing cluster to a "MapReduce"
|
|
||||||
cluster which will process old and new data and ship back the results
|
|
||||||
automatically.
|
|
||||||
</p>
|
|
||||||
<p>
|
|
||||||
The basic architecture pattern used for Apache HBase replication is (HBase cluster) master-push;
|
|
||||||
it is much easier to keep track of what’s currently being replicated since
|
|
||||||
each region server has its own write-ahead-log (aka WAL or HLog), just like
|
|
||||||
other well known solutions like MySQL master/slave replication where
|
|
||||||
there’s only one bin log to keep track of. One master cluster can
|
|
||||||
replicate to any number of slave clusters, and each region server will
|
|
||||||
participate to replicate their own stream of edits. For more information
|
|
||||||
on the different properties of master/slave replication and other types
|
|
||||||
of replication, please consult <a href="http://highscalability.com/blog/2009/8/24/how-google-serves-data-from-multiple-datacenters.html">
|
|
||||||
How Google Serves Data From Multiple Datacenters</a>.
|
|
||||||
</p>
|
|
||||||
<p>
|
|
||||||
The replication is done asynchronously, meaning that the clusters can
|
|
||||||
be geographically distant, the links between them can be offline for
|
|
||||||
some time, and rows inserted on the master cluster won’t be
|
|
||||||
available at the same time on the slave clusters (eventual consistency).
|
|
||||||
</p>
|
|
||||||
<p>
|
|
||||||
The replication format used in this design is conceptually the same as
|
|
||||||
<a href="http://dev.mysql.com/doc/refman/5.1/en/replication-formats.html">
|
|
||||||
MySQL’s statement-based replication </a>. Instead of SQL statements, whole
|
|
||||||
WALEdits (consisting of multiple cell inserts coming from the clients'
|
|
||||||
Put and Delete) are replicated in order to maintain atomicity.
|
|
||||||
</p>
|
|
||||||
<p>
|
|
||||||
The HLogs from each region server are the basis of HBase replication,
|
|
||||||
and must be kept in HDFS as long as they are needed to replicate data
|
|
||||||
to any slave cluster. Each RS reads from the oldest log it needs to
|
|
||||||
replicate and keeps the current position inside ZooKeeper to simplify
|
|
||||||
failure recovery. That position can be different for every slave
|
|
||||||
cluster, same for the queue of HLogs to process.
|
|
||||||
</p>
|
|
||||||
<p>
|
|
||||||
The clusters participating in replication can be of asymmetric sizes
|
|
||||||
and the master cluster will do its “best effort” to balance the stream
|
|
||||||
of replication on the slave clusters by relying on randomization.
|
|
||||||
</p>
|
|
||||||
<p>
|
|
||||||
As of version 0.92, Apache HBase supports master/master and cyclic
|
|
||||||
replication as well as replication to multiple slaves.
|
|
||||||
</p>
|
|
||||||
<img src="images/replication_overview.png"/>
|
|
||||||
</section>
|
|
||||||
<section name="Enabling replication">
|
|
||||||
<p>
|
|
||||||
The guide on enabling and using cluster replication is contained
|
|
||||||
in the API documentation shipped with your Apache HBase distribution.
|
|
||||||
</p>
|
|
||||||
<p>
|
|
||||||
The most up-to-date documentation is
|
|
||||||
<a href="apidocs/org/apache/hadoop/hbase/replication/package-summary.html#requirements">
|
|
||||||
available at this address</a>.
|
|
||||||
</p>
|
|
||||||
</section>
|
|
||||||
<section name="Life of a log edit">
|
|
||||||
<p>
|
|
||||||
The following sections describe the life of a single edit going from a
|
|
||||||
client that communicates with a master cluster all the way to a single
|
|
||||||
slave cluster.
|
|
||||||
</p>
|
|
||||||
<section name="Normal processing">
|
|
||||||
<p>
|
|
||||||
The client uses an API that sends a Put, Delete or ICV to a region
|
|
||||||
server. The key values are transformed into a WALEdit by the region
|
|
||||||
server and is inspected by the replication code that, for each family
|
|
||||||
that is scoped for replication, adds the scope to the edit. The edit
|
|
||||||
is appended to the current WAL and is then applied to its MemStore.
|
|
||||||
</p>
|
|
||||||
<p>
|
|
||||||
In a separate thread, the edit is read from the log (as part of a batch)
|
|
||||||
and only the KVs that are replicable are kept (that is, that they are part
|
|
||||||
of a family scoped GLOBAL in the family's schema, non-catalog so not
|
|
||||||
hbase:meta or -ROOT-, and did not originate in the target slave cluster - in
|
|
||||||
case of cyclic replication).
|
|
||||||
</p>
|
|
||||||
<p>
|
|
||||||
The edit is then tagged with the master's cluster UUID.
|
|
||||||
When the buffer is filled, or the reader hits the end of the file,
|
|
||||||
the buffer is sent to a random region server on the slave cluster.
|
|
||||||
</p>
|
|
||||||
<p>
|
|
||||||
Synchronously, the region server that receives the edits reads them
|
|
||||||
sequentially and separates each of them into buffers, one per table.
|
|
||||||
Once all edits are read, each buffer is flushed using HTable, the normal
|
|
||||||
HBase client.The master's cluster UUID is retained in the edits applied at
|
|
||||||
the slave cluster in order to allow cyclic replication.
|
|
||||||
</p>
|
|
||||||
<p>
|
|
||||||
Back in the master cluster's region server, the offset for the current
|
|
||||||
WAL that's being replicated is registered in ZooKeeper.
|
|
||||||
</p>
|
|
||||||
</section>
|
|
||||||
<section name="Non-responding slave clusters">
|
|
||||||
<p>
|
|
||||||
The edit is inserted in the same way.
|
|
||||||
</p>
|
|
||||||
<p>
|
|
||||||
In the separate thread, the region server reads, filters and buffers
|
|
||||||
the log edits the same way as during normal processing. The slave
|
|
||||||
region server that's contacted doesn't answer to the RPC, so the master
|
|
||||||
region server will sleep and retry up to a configured number of times.
|
|
||||||
If the slave RS still isn't available, the master cluster RS will select a
|
|
||||||
new subset of RS to replicate to and will retry sending the buffer of
|
|
||||||
edits.
|
|
||||||
</p>
|
|
||||||
<p>
|
|
||||||
In the mean time, the WALs will be rolled and stored in a queue in
|
|
||||||
ZooKeeper. Logs that are archived by their region server (archiving is
|
|
||||||
basically moving a log from the region server's logs directory to a
|
|
||||||
central logs archive directory) will update their paths in the in-memory
|
|
||||||
queue of the replicating thread.
|
|
||||||
</p>
|
|
||||||
<p>
|
|
||||||
When the slave cluster is finally available, the buffer will be applied
|
|
||||||
the same way as during normal processing. The master cluster RS will then
|
|
||||||
replicate the backlog of logs.
|
|
||||||
</p>
|
|
||||||
</section>
|
|
||||||
</section>
|
|
||||||
<section name="Internals">
|
|
||||||
<p>
|
|
||||||
This section describes in depth how each of replication's internal
|
|
||||||
features operate.
|
|
||||||
</p>
|
|
||||||
<section name="Replication Zookeeper State">
|
|
||||||
<p>
|
|
||||||
HBase replication maintains all of its state in Zookeeper. By default, this state is
|
|
||||||
contained in the base znode:
|
|
||||||
</p>
|
|
||||||
<pre>
|
|
||||||
/hbase/replication
|
|
||||||
</pre>
|
|
||||||
<p>
|
|
||||||
There are two major child znodes in the base replication znode:
|
|
||||||
<ul>
|
|
||||||
<li><b>Peers znode:</b> /hbase/replication/peers</li>
|
|
||||||
<li><b>RS znode:</b> /hbase/replication/rs</li>
|
|
||||||
</ul>
|
|
||||||
</p>
|
|
||||||
<section name="The Peers znode">
|
|
||||||
<p>
|
|
||||||
The <b>peers znode</b> contains a list of all peer replication clusters and the
|
|
||||||
current replication state of those clusters. It has one child <i>peer znode</i>
|
|
||||||
for each peer cluster. The <i>peer znode</i> is named with the cluster id provided
|
|
||||||
by the user in the HBase shell. The value of the <i>peer znode</i> contains
|
|
||||||
the peers cluster key provided by the user in the HBase Shell. The cluster key
|
|
||||||
contains a list of zookeeper nodes in the clusters quorum, the client port for the
|
|
||||||
zookeeper quorum, and the base znode for HBase
|
|
||||||
(i.e. “zk1.host.com,zk2.host.com,zk3.host.com:2181:/hbase”).
|
|
||||||
</p>
|
|
||||||
<pre>
|
|
||||||
/hbase/replication/peers
|
|
||||||
/1 [Value: zk1.host.com,zk2.host.com,zk3.host.com:2181:/hbase]
|
|
||||||
/2 [Value: zk5.host.com,zk6.host.com,zk7.host.com:2181:/hbase]
|
|
||||||
</pre>
|
|
||||||
<p>
|
|
||||||
Each of these <i>peer znodes</i> has a child znode that indicates whether or not
|
|
||||||
replication is enabled on that peer cluster. These <i>peer-state znodes</i> do not
|
|
||||||
have child znodes and simply contain a boolean value (i.e. ENABLED or DISABLED).
|
|
||||||
This value is read/maintained by the <i>ReplicationPeer.PeerStateTracker</i> class.
|
|
||||||
</p>
|
|
||||||
<pre>
|
|
||||||
/hbase/replication/peers
|
|
||||||
/1/peer-state [Value: ENABLED]
|
|
||||||
/2/peer-state [Value: DISABLED]
|
|
||||||
</pre>
|
|
||||||
</section>
|
|
||||||
<section name="The RS znode">
|
|
||||||
<p>
|
|
||||||
The <b>rs znode</b> contains a list of all outstanding HLog files in the cluster
|
|
||||||
that need to be replicated. The list is divided into a set of queues organized by
|
|
||||||
region server and the peer cluster the region server is shipping the HLogs to. The
|
|
||||||
<b>rs znode</b> has one child znode for each region server in the cluster. The child
|
|
||||||
znode name is simply the regionserver name (a concatenation of the region server’s
|
|
||||||
hostname, client port and start code). These region servers could either be dead or alive.
|
|
||||||
</p>
|
|
||||||
<pre>
|
|
||||||
/hbase/replication/rs
|
|
||||||
/hostname.example.org,6020,1234
|
|
||||||
/hostname2.example.org,6020,2856
|
|
||||||
</pre>
|
|
||||||
<p>
|
|
||||||
Within each region server znode, the region server maintains a set of HLog replication
|
|
||||||
queues. Each region server has one queue for every peer cluster it replicates to.
|
|
||||||
These queues are represented by child znodes named using the cluster id of the peer
|
|
||||||
cluster they represent (see the peer znode section).
|
|
||||||
</p>
|
|
||||||
<pre>
|
|
||||||
/hbase/replication/rs
|
|
||||||
/hostname.example.org,6020,1234
|
|
||||||
/1
|
|
||||||
/2
|
|
||||||
</pre>
|
|
||||||
<p>
|
|
||||||
Each queue has one child znode for every HLog that still needs to be replicated.
|
|
||||||
The value of these HLog child znodes is the latest position that has been replicated.
|
|
||||||
This position is updated every time a HLog entry is replicated.
|
|
||||||
</p>
|
|
||||||
<pre>
|
|
||||||
/hbase/replication/rs
|
|
||||||
/hostname.example.org,6020,1234
|
|
||||||
/1
|
|
||||||
23522342.23422 [VALUE: 254]
|
|
||||||
12340993.22342 [VALUE: 0]
|
|
||||||
</pre>
|
|
||||||
</section>
|
|
||||||
</section>
|
|
||||||
<section name="Configuration Parameters">
|
|
||||||
<section name="Zookeeper znode paths">
|
|
||||||
<p>
|
|
||||||
All of the base znode names are configurable through parameters:
|
|
||||||
</p>
|
|
||||||
<table border="1">
|
|
||||||
<tr>
|
|
||||||
<td><b>Parameter</b></td>
|
|
||||||
<td><b>Default Value</b></td>
|
|
||||||
</tr>
|
|
||||||
<tr>
|
|
||||||
<td>zookeeper.znode.parent</td>
|
|
||||||
<td>/hbase</td>
|
|
||||||
</tr>
|
|
||||||
<tr>
|
|
||||||
<td>zookeeper.znode.replication</td>
|
|
||||||
<td>replication</td>
|
|
||||||
</tr>
|
|
||||||
<tr>
|
|
||||||
<td>zookeeper.znode.replication.peers</td>
|
|
||||||
<td>peers</td>
|
|
||||||
</tr>
|
|
||||||
<tr>
|
|
||||||
<td>zookeeper.znode.replication.peers.state</td>
|
|
||||||
<td>peer-state</td>
|
|
||||||
</tr>
|
|
||||||
<tr>
|
|
||||||
<td>zookeeper.znode.replication.rs</td>
|
|
||||||
<td>rs</td>
|
|
||||||
</tr>
|
|
||||||
</table>
|
|
||||||
<p>
|
|
||||||
The default replication znode structure looks like the following:
|
|
||||||
</p>
|
|
||||||
<pre>
|
|
||||||
/hbase/replication/peers/{peerId}/peer-state
|
|
||||||
/hbase/replication/rs
|
|
||||||
</pre>
|
|
||||||
</section>
|
|
||||||
<section name="Other parameters">
|
|
||||||
<ul>
|
|
||||||
<li><b>hbase.replication</b> (Default: false) - Controls whether replication is enabled
|
|
||||||
or disabled for the cluster.</li>
|
|
||||||
<li><b>replication.sleep.before.failover</b> (Default: 2000) - The amount of time a failover
|
|
||||||
worker waits before attempting to replicate a dead region server’s HLog queues.</li>
|
|
||||||
<li><b>replication.executor.workers</b> (Default: 1) - The number of dead region servers
|
|
||||||
one region server should attempt to failover simultaneously.</li>
|
|
||||||
</ul>
|
|
||||||
</section>
|
|
||||||
</section>
|
|
||||||
<section name="Choosing region servers to replicate to">
|
|
||||||
<p>
|
|
||||||
When a master cluster RS initiates a replication source to a slave cluster,
|
|
||||||
it first connects to the slave's ZooKeeper ensemble using the provided
|
|
||||||
cluster key (that key is composed of the value of hbase.zookeeper.quorum,
|
|
||||||
zookeeper.znode.parent and hbase.zookeeper.property.clientPort). It
|
|
||||||
then scans the "rs" directory to discover all the available sinks
|
|
||||||
(region servers that are accepting incoming streams of edits to replicate)
|
|
||||||
and will randomly choose a subset of them using a configured
|
|
||||||
ratio (which has a default value of 10%). For example, if a slave
|
|
||||||
cluster has 150 machines, 15 will be chosen as potential recipient for
|
|
||||||
edits that this master cluster RS will be sending. Since this is done by all
|
|
||||||
master cluster RSs, the probability that all slave RSs are used is very high,
|
|
||||||
and this method works for clusters of any size. For example, a master cluster
|
|
||||||
of 10 machines replicating to a slave cluster of 5 machines with a ratio
|
|
||||||
of 10% means that the master cluster RSs will choose one machine each
|
|
||||||
at random, thus the chance of overlapping and full usage of the slave
|
|
||||||
cluster is higher.
|
|
||||||
</p>
|
|
||||||
<p>
|
|
||||||
A ZK watcher is placed on the ${zookeeper.znode.parent}/rs node of
|
|
||||||
the slave cluster by each of the master cluster's region servers.
|
|
||||||
This watch is used to monitor changes in the composition of the
|
|
||||||
slave cluster. When nodes are removed from the slave cluster (or
|
|
||||||
if nodes go down and/or come back up), the master cluster's region
|
|
||||||
servers will respond by selecting a new pool of slave region servers
|
|
||||||
to replicate to.
|
|
||||||
</p>
|
|
||||||
</section>
|
|
||||||
<section name="Keeping track of logs">
|
|
||||||
<p>
|
|
||||||
Every master cluster RS has its own znode in the replication znodes hierarchy.
|
|
||||||
It contains one znode per peer cluster (if 5 slave clusters, 5 znodes
|
|
||||||
are created), and each of these contain a queue
|
|
||||||
of HLogs to process. Each of these queues will track the HLogs created
|
|
||||||
by that RS, but they can differ in size. For example, if one slave
|
|
||||||
cluster becomes unavailable for some time then the HLogs should not be deleted,
|
|
||||||
thus they need to stay in the queue (while the others are processed).
|
|
||||||
See the section named "Region server failover" for an example.
|
|
||||||
</p>
|
|
||||||
<p>
|
|
||||||
When a source is instantiated, it contains the current HLog that the
|
|
||||||
region server is writing to. During log rolling, the new file is added
|
|
||||||
to the queue of each slave cluster's znode just before it's made available.
|
|
||||||
This ensures that all the sources are aware that a new log exists
|
|
||||||
before HLog is able to append edits into it, but this operations is
|
|
||||||
now more expensive.
|
|
||||||
The queue items are discarded when the replication thread cannot read
|
|
||||||
more entries from a file (because it reached the end of the last block)
|
|
||||||
and that there are other files in the queue.
|
|
||||||
This means that if a source is up-to-date and replicates from the log
|
|
||||||
that the region server writes to, reading up to the "end" of the
|
|
||||||
current file won't delete the item in the queue.
|
|
||||||
</p>
|
|
||||||
<p>
|
|
||||||
When a log is archived (because it's not used anymore or because there's
|
|
||||||
too many of them per hbase.regionserver.maxlogs typically because insertion
|
|
||||||
rate is faster than region flushing), it will notify the source threads that the path
|
|
||||||
for that log changed. If the a particular source was already done with
|
|
||||||
it, it will just ignore the message. If it's in the queue, the path
|
|
||||||
will be updated in memory. If the log is currently being replicated,
|
|
||||||
the change will be done atomically so that the reader doesn't try to
|
|
||||||
open the file when it's already moved. Also, moving a file is a NameNode
|
|
||||||
operation so, if the reader is currently reading the log, it won't
|
|
||||||
generate any exception.
|
|
||||||
</p>
|
|
||||||
</section>
|
|
||||||
<section name="Reading, filtering and sending edits">
|
|
||||||
<p>
|
|
||||||
By default, a source will try to read from a log file and ship log
|
|
||||||
entries as fast as possible to a sink. This is first limited by the
|
|
||||||
filtering of log entries; only KeyValues that are scoped GLOBAL and
|
|
||||||
that don't belong to catalog tables will be retained. A second limit
|
|
||||||
is imposed on the total size of the list of edits to replicate per slave,
|
|
||||||
which by default is 64MB. This means that a master cluster RS with 3 slaves
|
|
||||||
will use at most 192MB to store data to replicate. This doesn't account
|
|
||||||
the data filtered that wasn't garbage collected.
|
|
||||||
</p>
|
|
||||||
<p>
|
|
||||||
Once the maximum size of edits was buffered or the reader hits the end
|
|
||||||
of the log file, the source thread will stop reading and will choose
|
|
||||||
at random a sink to replicate to (from the list that was generated by
|
|
||||||
keeping only a subset of slave RSs). It will directly issue a RPC to
|
|
||||||
the chosen machine and will wait for the method to return. If it's
|
|
||||||
successful, the source will determine if the current file is emptied
|
|
||||||
or if it should continue to read from it. If the former, it will delete
|
|
||||||
the znode in the queue. If the latter, it will register the new offset
|
|
||||||
in the log's znode. If the RPC threw an exception, the source will retry
|
|
||||||
10 times until trying to find a different sink.
|
|
||||||
</p>
|
|
||||||
</section>
|
|
||||||
<section name="Cleaning logs">
|
|
||||||
<p>
|
|
||||||
If replication isn't enabled, the master's logs cleaning thread will
|
|
||||||
delete old logs using a configured TTL. This doesn't work well with
|
|
||||||
replication since archived logs passed their TTL may still be in a
|
|
||||||
queue. Thus, the default behavior is augmented so that if a log is
|
|
||||||
passed its TTL, the cleaning thread will lookup every queue until it
|
|
||||||
finds the log (while caching the ones it finds). If it's not found,
|
|
||||||
the log will be deleted. The next time it has to look for a log,
|
|
||||||
it will first use its cache.
|
|
||||||
</p>
|
|
||||||
</section>
|
|
||||||
<section name="Region server failover">
|
|
||||||
<p>
|
|
||||||
As long as region servers don't fail, keeping track of the logs in ZK
|
|
||||||
doesn't add any value. Unfortunately, they do fail, so since ZooKeeper
|
|
||||||
is highly available we can count on it and its semantics to help us
|
|
||||||
managing the transfer of the queues.
|
|
||||||
</p>
|
|
||||||
<p>
|
|
||||||
All the master cluster RSs keep a watcher on every other one of them to be
|
|
||||||
notified when one dies (just like the master does). When it happens,
|
|
||||||
they all race to create a znode called "lock" inside the dead RS' znode
|
|
||||||
that contains its queues. The one that creates it successfully will
|
|
||||||
proceed by transferring all the queues to its own znode (one by one
|
|
||||||
since ZK doesn't support the rename operation) and will delete all the
|
|
||||||
old ones when it's done. The recovered queues' znodes will be named
|
|
||||||
with the id of the slave cluster appended with the name of the dead
|
|
||||||
server.
|
|
||||||
</p>
|
|
||||||
<p>
|
|
||||||
Once that is done, the master cluster RS will create one new source thread per
|
|
||||||
copied queue, and each of them will follow the read/filter/ship pattern.
|
|
||||||
The main difference is that those queues will never have new data since
|
|
||||||
they don't belong to their new region server, which means that when
|
|
||||||
the reader hits the end of the last log, the queue's znode will be
|
|
||||||
deleted and the master cluster RS will close that replication source.
|
|
||||||
</p>
|
|
||||||
<p>
|
|
||||||
For example, consider a master cluster with 3 region servers that's
|
|
||||||
replicating to a single slave with id '2'. The following hierarchy
|
|
||||||
represents what the znodes layout could be at some point in time. We
|
|
||||||
can see the RSs' znodes all contain a "peers" znode that contains a
|
|
||||||
single queue. The znode names in the queues represent the actual file
|
|
||||||
names on HDFS in the form "address,port.timestamp".
|
|
||||||
</p>
|
|
||||||
<pre>
|
|
||||||
/hbase/replication/rs/
|
|
||||||
1.1.1.1,60020,123456780/
|
|
||||||
2/
|
|
||||||
1.1.1.1,60020.1234 (Contains a position)
|
|
||||||
1.1.1.1,60020.1265
|
|
||||||
1.1.1.2,60020,123456790/
|
|
||||||
2/
|
|
||||||
1.1.1.2,60020.1214 (Contains a position)
|
|
||||||
1.1.1.2,60020.1248
|
|
||||||
1.1.1.2,60020.1312
|
|
||||||
1.1.1.3,60020, 123456630/
|
|
||||||
2/
|
|
||||||
1.1.1.3,60020.1280 (Contains a position)
|
|
||||||
</pre>
|
|
||||||
<p>
|
|
||||||
Now let's say that 1.1.1.2 loses its ZK session. The survivors will race
|
|
||||||
to create a lock, and for some reasons 1.1.1.3 wins. It will then start
|
|
||||||
transferring all the queues to its local peers znode by appending the
|
|
||||||
name of the dead server. Right before 1.1.1.3 is able to clean up the
|
|
||||||
old znodes, the layout will look like the following:
|
|
||||||
</p>
|
|
||||||
<pre>
|
|
||||||
/hbase/replication/rs/
|
|
||||||
1.1.1.1,60020,123456780/
|
|
||||||
2/
|
|
||||||
1.1.1.1,60020.1234 (Contains a position)
|
|
||||||
1.1.1.1,60020.1265
|
|
||||||
1.1.1.2,60020,123456790/
|
|
||||||
lock
|
|
||||||
2/
|
|
||||||
1.1.1.2,60020.1214 (Contains a position)
|
|
||||||
1.1.1.2,60020.1248
|
|
||||||
1.1.1.2,60020.1312
|
|
||||||
1.1.1.3,60020,123456630/
|
|
||||||
2/
|
|
||||||
1.1.1.3,60020.1280 (Contains a position)
|
|
||||||
|
|
||||||
2-1.1.1.2,60020,123456790/
|
|
||||||
1.1.1.2,60020.1214 (Contains a position)
|
|
||||||
1.1.1.2,60020.1248
|
|
||||||
1.1.1.2,60020.1312
|
|
||||||
</pre>
|
|
||||||
<p>
|
|
||||||
Some time later, but before 1.1.1.3 is able to finish replicating the
|
|
||||||
last HLog from 1.1.1.2, let's say that it dies too (also some new logs
|
|
||||||
were created in the normal queues). The last RS will then try to lock
|
|
||||||
1.1.1.3's znode and will begin transferring all the queues. The new
|
|
||||||
layout will be:
|
|
||||||
</p>
|
|
||||||
<pre>
|
|
||||||
/hbase/replication/rs/
|
|
||||||
1.1.1.1,60020,123456780/
|
|
||||||
2/
|
|
||||||
1.1.1.1,60020.1378 (Contains a position)
|
|
||||||
|
|
||||||
2-1.1.1.3,60020,123456630/
|
|
||||||
1.1.1.3,60020.1325 (Contains a position)
|
|
||||||
1.1.1.3,60020.1401
|
|
||||||
|
|
||||||
2-1.1.1.2,60020,123456790-1.1.1.3,60020,123456630/
|
|
||||||
1.1.1.2,60020.1312 (Contains a position)
|
|
||||||
1.1.1.3,60020,123456630/
|
|
||||||
lock
|
|
||||||
2/
|
|
||||||
1.1.1.3,60020.1325 (Contains a position)
|
|
||||||
1.1.1.3,60020.1401
|
|
||||||
|
|
||||||
2-1.1.1.2,60020,123456790/
|
|
||||||
1.1.1.2,60020.1312 (Contains a position)
|
|
||||||
</pre>
|
|
||||||
</section>
|
|
||||||
</section>
|
|
||||||
<section name="Replication Metrics">
|
|
||||||
Following the some useful metrics which can be used to check the replication progress:
|
|
||||||
<ul>
|
|
||||||
<li><b>source.sizeOfLogQueue:</b> number of HLogs to process (excludes the one which is being
|
|
||||||
processed) at the Replication source</li>
|
|
||||||
<li><b>source.shippedOps:</b> number of mutations shipped</li>
|
|
||||||
<li><b>source.logEditsRead:</b> number of mutations read from HLogs at the replication source</li>
|
|
||||||
<li><b>source.ageOfLastShippedOp:</b> age of last batch that was shipped by the replication source</li>
|
|
||||||
</ul>
|
|
||||||
Please note that the above metrics are at the global level at this regionserver. In 0.95.0 and onwards, these
|
|
||||||
metrics are also exposed per peer level.
|
|
||||||
</section>
|
|
||||||
|
|
||||||
<section name="FAQ">
|
|
||||||
<section name="GLOBAL means replicate? Any provision to replicate only to cluster X and not to cluster Y? or is that for later?">
|
|
||||||
<p>
|
|
||||||
Yes, this is for much later.
|
|
||||||
</p>
|
|
||||||
</section>
|
|
||||||
<section name="You need a bulk edit shipper? Something that allows you transfer 64MB of edits in one go?">
|
|
||||||
<p>
|
|
||||||
You can use the HBase-provided utility called CopyTable from the package
|
|
||||||
org.apache.hadoop.hbase.mapreduce in order to have a discp-like tool to
|
|
||||||
bulk copy data.
|
|
||||||
</p>
|
|
||||||
</section>
|
|
||||||
<section name="Is it a mistake that WALEdit doesn't carry Put and Delete objects, that we have to reinstantiate not only when replicating but when replaying edits also?">
|
|
||||||
<p>
|
|
||||||
Yes, this behavior would help a lot but it's not currently available
|
|
||||||
in HBase (BatchUpdate had that, but it was lost in the new API).
|
|
||||||
</p>
|
|
||||||
</section>
|
|
||||||
<section name="Is there an issue replicating on Hadoop 1.0/1.1 when short-circuit reads are enabled?">
|
|
||||||
<p>
|
|
||||||
Yes. See <a href="https://issues.apache.org/jira/browse/HDFS-2757">HDFS-2757</a>.
|
|
||||||
</p>
|
|
||||||
</section>
|
|
||||||
</section>
|
|
||||||
</body>
|
</body>
|
||||||
</document>
|
</document>
|
||||||
|
|
Loading…
Reference in New Issue