From db9cb9ca08985ab0b12901abf66609caaac923bf Mon Sep 17 00:00:00 2001
From: Michael Stack <stack@duboce.net>
Date: Mon, 2 Jun 2014 09:52:01 -0700
Subject: [PATCH] HBASE-6701 Revisit thrust of paragraph on splitting (Misty
 Stanley-Jones)

---
 src/main/docbkx/configuration.xml | 177 ++++++++++++++----------------
 1 file changed, 85 insertions(+), 92 deletions(-)
diff --git a/src/main/docbkx/configuration.xml b/src/main/docbkx/configuration.xml
index 2fbc3e83b07..5e6bdbb88b3 100644
--- a/src/main/docbkx/configuration.xml
+++ b/src/main/docbkx/configuration.xml
@@ -1194,7 +1194,7 @@ index e70ebc6..96f8c27 100644
         xml:id="recommended_configurations.zk">
         <title>ZooKeeper Configuration</title>
         <section
-          xml:id="zookeeper.session.timeout">
+          xml:id="sect.zookeeper.session.timeout">
           <title><varname>zookeeper.session.timeout</varname></title>
           <para>The default timeout is three minutes (specified in milliseconds). This means that if
             a server crashes, it will be three minutes before the Master notices the crash and
@@ -1295,41 +1295,52 @@ index e70ebc6..96f8c27 100644
       <section
         xml:id="disable.splitting">
         <title>Managed Splitting</title>
-        <para> Rather than let HBase auto-split your Regions, manage the splitting manually <footnote>
-            <para>What follows is taken from the javadoc at the head of the
-                <classname>org.apache.hadoop.hbase.util.RegionSplitter</classname> tool added to
-              HBase post-0.90.0 release. </para>
-          </footnote>. With growing amounts of data, splits will continually be needed. Since you
-          always know exactly what regions you have, long-term debugging and profiling is much
-          easier with manual splits. It is hard to trace the logs to understand region level
-          problems if it keeps splitting and getting renamed. Data offlining bugs + unknown number
-          of split regions == oh crap! If an <classname>HLog</classname> or
-            <classname>StoreFile</classname> was mistakenly unprocessed by HBase due to a weird bug
-          and you notice it a day or so later, you can be assured that the regions specified in
-          these files are the same as the current regions and you have less headaches trying to
-          restore/replay your data. You can finely tune your compaction algorithm. With roughly
-          uniform data growth, it's easy to cause split / compaction storms as the regions all
-          roughly hit the same data size at the same time. With manual splits, you can let
-          staggered, time-based major compactions spread out your network IO load. </para>
-        <para> How do I turn off automatic splitting? Automatic splitting is determined by the
-          configuration value <code>hbase.hregion.max.filesize</code>. It is not recommended that
-          you set this to <varname>Long.MAX_VALUE</varname> in case you forget about manual splits.
-          A suggested setting is 100GB, which would result in > 1hr major compactions if reached. </para>
-        <para>What's the optimal number of pre-split regions to create? Mileage will vary depending
-          upon your application. You could start low with 10 pre-split regions / server and watch as
-          data grows over time. It's better to err on the side of too little regions and rolling
-          split later. A more complicated answer is that this depends upon the largest storefile in
-          your region. With a growing data size, this will get larger over time. You want the
-          largest region to be just big enough that the <classname>Store</classname> compact
-          selection algorithm only compacts it due to a timed major. If you don't, your cluster can
-          be prone to compaction storms as the algorithm decides to run major compactions on a large
-          series of regions all at once. Note that compaction storms are due to the uniform data
-          growth, not the manual split decision. </para>
-        <para> If you pre-split your regions too thin, you can increase the major compaction
-          interval by configuring <varname>HConstants.MAJOR_COMPACTION_PERIOD</varname>. If your
-          data size grows too large, use the (post-0.90.0 HBase)
-            <classname>org.apache.hadoop.hbase.util.RegionSplitter</classname> script to perform a
-          network IO safe rolling split of all regions. </para>
+        <para>HBase generally handles splitting your regions, based upon the settings in your
+            <filename>hbase-default.xml</filename> and <filename>hbase-site.xml</filename>
+          configuration files. Important settings include
+            <varname>hbase.regionserver.region.split.policy</varname>,
+            <varname>hbase.hregion.max.filesize</varname>,
+            <varname>hbase.regionserver.regionSplitLimit</varname>. A simplistic view of splitting
+          is that when a region grows to <varname>hbase.hregion.max.filesize</varname>, it is split.
+          For most use patterns, most of the time, you should use automatic splitting.</para>
+        <para>Instead of allowing HBase to split your regions automatically, you can choose to
+          manage the splitting yourself. This feature was added in HBase 0.90.0. Manually managing
+          splits works if you know your keyspace well, otherwise let HBase figure where to split for you.
+          Manual splitting can mitigate region creation and movement under load. It also makes it so
+          region boundaries are known and invariant (if you disable region splitting). If you use manual
+          splits, it is easier doing staggered, time-based major compactions spread out your network IO
+          load.</para>
+
+        <formalpara>
+          <title>Disable Automatic Splitting</title>
+          <para>To disable automatic splitting, set <varname>hbase.hregion.max.filesize</varname> to
+            a very large value, such as <literal>100 GB</literal> It is not recommended to set it to
+            its absolute maximum value of <literal>Long.MAX_VALUE</literal>.</para>
+        </formalpara>
+        <note>
+          <title>Automatic Splitting Is Recommended</title>
+          <para>If you disable automatic splits to diagnose a problem or during a period of fast
+            data growth, it is recommended to re-enable them when your situation becomes more
+            stable. The potential benefits of managing region splits yourself are not
+            undisputed.</para>
+        </note>
+        <formalpara>
+          <title>Determine the Optimal Number of Pre-Split Regions</title>
+          <para>The optimal number of pre-split regions depends on your application and environment.
+            A good rule of thumb is to start with 10 pre-split regions per server and watch as data
+            grows over time. It is better to err on the side of too few regions and perform rolling
+            splits later. The optimal number of regions depends upon the largest StoreFile in your
+            region. The size of the largest StoreFile will increase with time if the amount of data
+            grows. The goal is for the largest region to be just large enough that the compaction
+            selection algorithm only compacts it during a timed major compaction. Otherwise, the
+            cluster can be prone to compaction storms where a large number of regions under
+            compaction at the same time. It is important to understand that the data growth causes
+            compaction storms, and not the manual split decision.</para>
+        </formalpara>
+        <para>If the regions are split into too many large regions, you can increase the major
+          compaction interval by configuring <varname>HConstants.MAJOR_COMPACTION_PERIOD</varname>.
+          HBase 0.90 introduced <classname>org.apache.hadoop.hbase.util.RegionSplitter</classname>,
+          which provides a network-IO-safe rolling split of all regions.</para>
       </section>
       <section
         xml:id="managed.compactions">
@@ -1356,62 +1367,44 @@ index e70ebc6..96f8c27 100644
             <varname>mapreduce.reduce.speculative</varname> to false. </para>
       </section>
     </section>
-
-    <section
-      xml:id="other_configuration">
-      <title>Other Configurations</title>
-      <section
-        xml:id="balancer_config">
-        <title>Balancer</title>
-        <para>The balancer is a periodic operation which is run on the master to redistribute
-          regions on the cluster. It is configured via <varname>hbase.balancer.period</varname> and
-          defaults to 300000 (5 minutes). </para>
-        <para>See <xref
-            linkend="master.processes.loadbalancer" /> for more information on the LoadBalancer.
-        </para>
-      </section>
-      <section
-        xml:id="disabling.blockcache">
-        <title>Disabling Blockcache</title>
-        <para>Do not turn off block cache (You'd do it by setting
-            <varname>hbase.block.cache.size</varname> to zero). Currently we do not do well if you
-          do this because the regionserver will spend all its time loading hfile indices over and
-          over again. If your working set it such that block cache does you no good, at least size
-          the block cache such that hfile indices will stay up in the cache (you can get a rough
-          idea on the size you need by surveying regionserver UIs; you'll see index block size
-          accounted near the top of the webpage).</para>
-      </section>
-      <section
-        xml:id="nagles">
-        <title><link
-            xlink:href="http://en.wikipedia.org/wiki/Nagle's_algorithm">Nagle's</link> or the small
-          package problem</title>
-        <para>If a big 40ms or so occasional delay is seen in operations against HBase, try the
-          Nagles' setting. For example, see the user mailing list thread, <link
-            xlink:href="http://search-hadoop.com/m/pduLg2fydtE/Inconsistent+scan+performance+with+caching+set+&amp;subj=Re+Inconsistent+scan+performance+with+caching+set+to+1">Inconsistent
-            scan performance with caching set to 1</link> and the issue cited therein where setting
-          notcpdelay improved scan speeds. You might also see the graphs on the tail of <link
-            xlink:href="https://issues.apache.org/jira/browse/HBASE-7008">HBASE-7008 Set scanner
-            caching to a better default</link> where our Lars Hofhansl tries various data sizes w/
-          Nagle's on and off measuring the effect.</para>
-      </section>
-      <section
-        xml:id="mttr">
-        <title>Better Mean Time to Recover (MTTR)</title>
-        <para>This section is about configurations that will make servers come back faster after a
-          fail. See the Deveraj Das an Nicolas Liochon blog post <link
-            xlink:href="http://hortonworks.com/blog/introduction-to-hbase-mean-time-to-recover-mttr/">Introduction
-            to HBase Mean Time to Recover (MTTR)</link> for a brief introduction.</para>
-        <para>The issue <link
-            xlink:href="https://issues.apache.org/jira/browse/HBASE-8389">HBASE-8354 forces Namenode
-            into loop with lease recovery requests</link> is messy but has a bunch of good
-          discussion toward the end on low timeouts and how to effect faster recovery including
-          citation of fixes added to HDFS. Read the Varun Sharma comments. The below suggested
-          configurations are Varun's suggestions distilled and tested. Make sure you are running on
-          a late-version HDFS so you have the fixes he refers too and himself adds to HDFS that help
-          HBase MTTR (e.g. HDFS-3703, HDFS-3712, and HDFS-4791 -- hadoop 2 for sure has them and
-          late hadoop 1 has some). Set the following in the RegionServer. </para>
-        <programlisting><![CDATA[
+      <section xml:id="other_configuration"><title>Other Configurations</title>
+         <section xml:id="balancer_config"><title>Balancer</title>
+           <para>The balancer is a periodic operation which is run on the master to redistribute regions on the cluster.  It is configured via
+           <varname>hbase.balancer.period</varname> and defaults to 300000 (5 minutes). </para>
+           <para>See <xref linkend="master.processes.loadbalancer" /> for more information on the LoadBalancer.
+           </para>
+         </section>
+         <section xml:id="disabling.blockcache"><title>Disabling Blockcache</title>
+           <para>Do not turn off block cache (You'd do it by setting <varname>hbase.block.cache.size</varname> to zero).
+           Currently we do not do well if you do this because the regionserver will spend all its time loading hfile
+           indices over and over again.  In fact, in later versions of HBase, it is not possible to disable the
+           block cache completely.
+           HBase will cache meta blocks -- the INDEX and BLOOM blocks -- even if the block cache 
+           is disabled.</para>
+         </section>
+    <section xml:id="nagles">
+      <title><link xlink:href="http://en.wikipedia.org/wiki/Nagle's_algorithm">Nagle's</link> or the small package problem</title>
+      <para>If a big 40ms or so occasional delay is seen in operations against HBase,
+      try the Nagles' setting.  For example, see the user mailing list thread,
+      <link xlink:href="http://search-hadoop.com/m/pduLg2fydtE/Inconsistent+scan+performance+with+caching+set+&amp;subj=Re+Inconsistent+scan+performance+with+caching+set+to+1">Inconsistent scan performance with caching set to 1</link>
+      and the issue cited therein where setting notcpdelay improved scan speeds.  You might also
+      see the graphs on the tail of <link xlink:href="https://issues.apache.org/jira/browse/HBASE-7008">HBASE-7008 Set scanner caching to a better default</link>
+      where our Lars Hofhansl tries various data sizes w/ Nagle's on and off measuring the effect.</para>
+    </section>
+    <section xml:id="mttr">
+      <title>Better Mean Time to Recover (MTTR)</title>
+      <para>This section is about configurations that will make servers come back faster after a fail.
+          See the Deveraj Das an Nicolas Liochon blog post
+          <link xlink:href="http://hortonworks.com/blog/introduction-to-hbase-mean-time-to-recover-mttr/">Introduction to HBase Mean Time to Recover (MTTR)</link>
+          for a brief introduction.</para>
+      <para>The issue <link xlink:href="https://issues.apache.org/jira/browse/HBASE-8389">HBASE-8354 forces Namenode into loop with lease recovery requests</link>
+          is messy but has a bunch of good discussion toward the end on low timeouts and how to effect faster recovery including citation of fixes
+          added to HDFS.  Read the Varun Sharma comments.  The below suggested configurations are Varun's suggestions distilled and tested.  Make sure you are
+          running on a late-version HDFS so you have the fixes he refers too and himself adds to HDFS that help HBase MTTR
+          (e.g. HDFS-3703, HDFS-3712, and HDFS-4791 -- hadoop 2 for sure has them and late hadoop 1 has some).
+          Set the following in the RegionServer.</para>
+      <programlisting>
+<![CDATA[<property>
 <property>
     <name>hbase.lease.recovery.dfs.timeout</name>
     <value>23000</value>