HBASE-6701 Revisit thrust of paragraph on splitting (Misty Stanley-Jones)

This commit is contained in:
Michael Stack 2014-06-02 09:52:01 -07:00
parent 768c4d6775
commit db9cb9ca08
1 changed files with 85 additions and 92 deletions

View File

@ -1194,7 +1194,7 @@ index e70ebc6..96f8c27 100644
xml:id="recommended_configurations.zk"> xml:id="recommended_configurations.zk">
<title>ZooKeeper Configuration</title> <title>ZooKeeper Configuration</title>
<section <section
xml:id="zookeeper.session.timeout"> xml:id="sect.zookeeper.session.timeout">
<title><varname>zookeeper.session.timeout</varname></title> <title><varname>zookeeper.session.timeout</varname></title>
<para>The default timeout is three minutes (specified in milliseconds). This means that if <para>The default timeout is three minutes (specified in milliseconds). This means that if
a server crashes, it will be three minutes before the Master notices the crash and a server crashes, it will be three minutes before the Master notices the crash and
@ -1295,41 +1295,52 @@ index e70ebc6..96f8c27 100644
<section <section
xml:id="disable.splitting"> xml:id="disable.splitting">
<title>Managed Splitting</title> <title>Managed Splitting</title>
<para> Rather than let HBase auto-split your Regions, manage the splitting manually <footnote> <para>HBase generally handles splitting your regions, based upon the settings in your
<para>What follows is taken from the javadoc at the head of the <filename>hbase-default.xml</filename> and <filename>hbase-site.xml</filename>
<classname>org.apache.hadoop.hbase.util.RegionSplitter</classname> tool added to configuration files. Important settings include
HBase post-0.90.0 release. </para> <varname>hbase.regionserver.region.split.policy</varname>,
</footnote>. With growing amounts of data, splits will continually be needed. Since you <varname>hbase.hregion.max.filesize</varname>,
always know exactly what regions you have, long-term debugging and profiling is much <varname>hbase.regionserver.regionSplitLimit</varname>. A simplistic view of splitting
easier with manual splits. It is hard to trace the logs to understand region level is that when a region grows to <varname>hbase.hregion.max.filesize</varname>, it is split.
problems if it keeps splitting and getting renamed. Data offlining bugs + unknown number For most use patterns, most of the time, you should use automatic splitting.</para>
of split regions == oh crap! If an <classname>HLog</classname> or <para>Instead of allowing HBase to split your regions automatically, you can choose to
<classname>StoreFile</classname> was mistakenly unprocessed by HBase due to a weird bug manage the splitting yourself. This feature was added in HBase 0.90.0. Manually managing
and you notice it a day or so later, you can be assured that the regions specified in splits works if you know your keyspace well, otherwise let HBase figure where to split for you.
these files are the same as the current regions and you have less headaches trying to Manual splitting can mitigate region creation and movement under load. It also makes it so
restore/replay your data. You can finely tune your compaction algorithm. With roughly region boundaries are known and invariant (if you disable region splitting). If you use manual
uniform data growth, it's easy to cause split / compaction storms as the regions all splits, it is easier doing staggered, time-based major compactions spread out your network IO
roughly hit the same data size at the same time. With manual splits, you can let load.</para>
staggered, time-based major compactions spread out your network IO load. </para>
<para> How do I turn off automatic splitting? Automatic splitting is determined by the <formalpara>
configuration value <code>hbase.hregion.max.filesize</code>. It is not recommended that <title>Disable Automatic Splitting</title>
you set this to <varname>Long.MAX_VALUE</varname> in case you forget about manual splits. <para>To disable automatic splitting, set <varname>hbase.hregion.max.filesize</varname> to
A suggested setting is 100GB, which would result in > 1hr major compactions if reached. </para> a very large value, such as <literal>100 GB</literal> It is not recommended to set it to
<para>What's the optimal number of pre-split regions to create? Mileage will vary depending its absolute maximum value of <literal>Long.MAX_VALUE</literal>.</para>
upon your application. You could start low with 10 pre-split regions / server and watch as </formalpara>
data grows over time. It's better to err on the side of too little regions and rolling <note>
split later. A more complicated answer is that this depends upon the largest storefile in <title>Automatic Splitting Is Recommended</title>
your region. With a growing data size, this will get larger over time. You want the <para>If you disable automatic splits to diagnose a problem or during a period of fast
largest region to be just big enough that the <classname>Store</classname> compact data growth, it is recommended to re-enable them when your situation becomes more
selection algorithm only compacts it due to a timed major. If you don't, your cluster can stable. The potential benefits of managing region splits yourself are not
be prone to compaction storms as the algorithm decides to run major compactions on a large undisputed.</para>
series of regions all at once. Note that compaction storms are due to the uniform data </note>
growth, not the manual split decision. </para> <formalpara>
<para> If you pre-split your regions too thin, you can increase the major compaction <title>Determine the Optimal Number of Pre-Split Regions</title>
interval by configuring <varname>HConstants.MAJOR_COMPACTION_PERIOD</varname>. If your <para>The optimal number of pre-split regions depends on your application and environment.
data size grows too large, use the (post-0.90.0 HBase) A good rule of thumb is to start with 10 pre-split regions per server and watch as data
<classname>org.apache.hadoop.hbase.util.RegionSplitter</classname> script to perform a grows over time. It is better to err on the side of too few regions and perform rolling
network IO safe rolling split of all regions. </para> splits later. The optimal number of regions depends upon the largest StoreFile in your
region. The size of the largest StoreFile will increase with time if the amount of data
grows. The goal is for the largest region to be just large enough that the compaction
selection algorithm only compacts it during a timed major compaction. Otherwise, the
cluster can be prone to compaction storms where a large number of regions under
compaction at the same time. It is important to understand that the data growth causes
compaction storms, and not the manual split decision.</para>
</formalpara>
<para>If the regions are split into too many large regions, you can increase the major
compaction interval by configuring <varname>HConstants.MAJOR_COMPACTION_PERIOD</varname>.
HBase 0.90 introduced <classname>org.apache.hadoop.hbase.util.RegionSplitter</classname>,
which provides a network-IO-safe rolling split of all regions.</para>
</section> </section>
<section <section
xml:id="managed.compactions"> xml:id="managed.compactions">
@ -1356,62 +1367,44 @@ index e70ebc6..96f8c27 100644
<varname>mapreduce.reduce.speculative</varname> to false. </para> <varname>mapreduce.reduce.speculative</varname> to false. </para>
</section> </section>
</section> </section>
<section xml:id="other_configuration"><title>Other Configurations</title>
<section <section xml:id="balancer_config"><title>Balancer</title>
xml:id="other_configuration"> <para>The balancer is a periodic operation which is run on the master to redistribute regions on the cluster. It is configured via
<title>Other Configurations</title> <varname>hbase.balancer.period</varname> and defaults to 300000 (5 minutes). </para>
<section <para>See <xref linkend="master.processes.loadbalancer" /> for more information on the LoadBalancer.
xml:id="balancer_config"> </para>
<title>Balancer</title> </section>
<para>The balancer is a periodic operation which is run on the master to redistribute <section xml:id="disabling.blockcache"><title>Disabling Blockcache</title>
regions on the cluster. It is configured via <varname>hbase.balancer.period</varname> and <para>Do not turn off block cache (You'd do it by setting <varname>hbase.block.cache.size</varname> to zero).
defaults to 300000 (5 minutes). </para> Currently we do not do well if you do this because the regionserver will spend all its time loading hfile
<para>See <xref indices over and over again. In fact, in later versions of HBase, it is not possible to disable the
linkend="master.processes.loadbalancer" /> for more information on the LoadBalancer. block cache completely.
</para> HBase will cache meta blocks -- the INDEX and BLOOM blocks -- even if the block cache
</section> is disabled.</para>
<section </section>
xml:id="disabling.blockcache"> <section xml:id="nagles">
<title>Disabling Blockcache</title> <title><link xlink:href="http://en.wikipedia.org/wiki/Nagle's_algorithm">Nagle's</link> or the small package problem</title>
<para>Do not turn off block cache (You'd do it by setting <para>If a big 40ms or so occasional delay is seen in operations against HBase,
<varname>hbase.block.cache.size</varname> to zero). Currently we do not do well if you try the Nagles' setting. For example, see the user mailing list thread,
do this because the regionserver will spend all its time loading hfile indices over and <link xlink:href="http://search-hadoop.com/m/pduLg2fydtE/Inconsistent+scan+performance+with+caching+set+&amp;subj=Re+Inconsistent+scan+performance+with+caching+set+to+1">Inconsistent scan performance with caching set to 1</link>
over again. If your working set it such that block cache does you no good, at least size and the issue cited therein where setting notcpdelay improved scan speeds. You might also
the block cache such that hfile indices will stay up in the cache (you can get a rough see the graphs on the tail of <link xlink:href="https://issues.apache.org/jira/browse/HBASE-7008">HBASE-7008 Set scanner caching to a better default</link>
idea on the size you need by surveying regionserver UIs; you'll see index block size where our Lars Hofhansl tries various data sizes w/ Nagle's on and off measuring the effect.</para>
accounted near the top of the webpage).</para> </section>
</section> <section xml:id="mttr">
<section <title>Better Mean Time to Recover (MTTR)</title>
xml:id="nagles"> <para>This section is about configurations that will make servers come back faster after a fail.
<title><link See the Deveraj Das an Nicolas Liochon blog post
xlink:href="http://en.wikipedia.org/wiki/Nagle's_algorithm">Nagle's</link> or the small <link xlink:href="http://hortonworks.com/blog/introduction-to-hbase-mean-time-to-recover-mttr/">Introduction to HBase Mean Time to Recover (MTTR)</link>
package problem</title> for a brief introduction.</para>
<para>If a big 40ms or so occasional delay is seen in operations against HBase, try the <para>The issue <link xlink:href="https://issues.apache.org/jira/browse/HBASE-8389">HBASE-8354 forces Namenode into loop with lease recovery requests</link>
Nagles' setting. For example, see the user mailing list thread, <link is messy but has a bunch of good discussion toward the end on low timeouts and how to effect faster recovery including citation of fixes
xlink:href="http://search-hadoop.com/m/pduLg2fydtE/Inconsistent+scan+performance+with+caching+set+&amp;subj=Re+Inconsistent+scan+performance+with+caching+set+to+1">Inconsistent added to HDFS. Read the Varun Sharma comments. The below suggested configurations are Varun's suggestions distilled and tested. Make sure you are
scan performance with caching set to 1</link> and the issue cited therein where setting running on a late-version HDFS so you have the fixes he refers too and himself adds to HDFS that help HBase MTTR
notcpdelay improved scan speeds. You might also see the graphs on the tail of <link (e.g. HDFS-3703, HDFS-3712, and HDFS-4791 -- hadoop 2 for sure has them and late hadoop 1 has some).
xlink:href="https://issues.apache.org/jira/browse/HBASE-7008">HBASE-7008 Set scanner Set the following in the RegionServer.</para>
caching to a better default</link> where our Lars Hofhansl tries various data sizes w/ <programlisting>
Nagle's on and off measuring the effect.</para> <![CDATA[<property>
</section>
<section
xml:id="mttr">
<title>Better Mean Time to Recover (MTTR)</title>
<para>This section is about configurations that will make servers come back faster after a
fail. See the Deveraj Das an Nicolas Liochon blog post <link
xlink:href="http://hortonworks.com/blog/introduction-to-hbase-mean-time-to-recover-mttr/">Introduction
to HBase Mean Time to Recover (MTTR)</link> for a brief introduction.</para>
<para>The issue <link
xlink:href="https://issues.apache.org/jira/browse/HBASE-8389">HBASE-8354 forces Namenode
into loop with lease recovery requests</link> is messy but has a bunch of good
discussion toward the end on low timeouts and how to effect faster recovery including
citation of fixes added to HDFS. Read the Varun Sharma comments. The below suggested
configurations are Varun's suggestions distilled and tested. Make sure you are running on
a late-version HDFS so you have the fixes he refers too and himself adds to HDFS that help
HBase MTTR (e.g. HDFS-3703, HDFS-3712, and HDFS-4791 -- hadoop 2 for sure has them and
late hadoop 1 has some). Set the following in the RegionServer. </para>
<programlisting><![CDATA[
<property> <property>
<name>hbase.lease.recovery.dfs.timeout</name> <name>hbase.lease.recovery.dfs.timeout</name>
<value>23000</value> <value>23000</value>