hbase-5503. performance.xml, troubleshooting.xml - adding Troubleshooting case study

git-svn-id: https://svn.apache.org/repos/asf/hbase/trunk@1295915 13f79535-47bb-0310-9956-ffa450edef68
2012-03-01 21:40:34 +00:00 · 2012-03-01 21:40:34 +00:00 · 0c4e1dfa75
parent 7147bc5424
commit 0c4e1dfa75
2 changed files with 139 additions and 2 deletions
--- a/src/docbkx/performance.xml
+++ b/src/docbkx/performance.xml
@ -93,9 +93,14 @@
      <para>Using 10Gbe links between racks will greatly increase performance, and assuming your switches support a 10Gbe uplink or allow for an expansion card will allow you to
      save your ports for machines as opposed to uplinks.
      </para>
-      
+    </section>
    <section xml:id="perf.network.ints">
      <title>Network Interfaces</title>
      <para>Are all the network interfaces functioning correctly?  Are you sure?  See the Troubleshooting Case Study in <xref linkend="trouble.casestudy"/>.
      </para>
    </section>
  </section>  <!-- network -->
  <section xml:id="jvm">
    <title>Java</title>
@ -424,6 +429,13 @@ Deferred log flush can be configured on tables via <link
      in the input scan because attribute over-selection is a non-trivial performance penalty over large datasets.
      </para>
    </section>
    <section xml:id="perf.hbase.mr.input">
        <title>MapReduce - Input Splits</title>
        <para>For MapReduce jobs that use HBase tables as a source, if there a pattern where the "slow" map tasks seem to 
        have the same Input Split (i.e., the RegionServer serving the data), see the 
        Troubleshooting Case Study in <xref linkend="trouble.casestudy"/>.
        </para>
    </section>
    <section xml:id="perf.hbase.client.scannerclose">
      <title>Close ResultScanners</title>
--- a/src/docbkx/troubleshooting.xml
+++ b/src/docbkx/troubleshooting.xml
@ -709,6 +709,12 @@ Caused by: java.io.FileNotFoundException: File _partition.lst does not exist.
        <para>HBase expects the loopback IP Address to be 127.0.0.1.  See the Getting Started section on <xref linkend="loopback.ip" />.
        </para>
       </section>
      <section xml:id="trouble.network.ints">
        <title>Network Interfaces</title>
        <para>Are all the network interfaces functioning correctly?  Are you sure?  See the Troubleshooting Case Study in <xref linkend="trouble.casestudy"/>.
        </para>
      </section>
    </section>
    <section xml:id="trouble.rs">
@ -1020,5 +1026,124 @@ in your Hadoop's <filename>lib</filename> directory.  That should fix the above
 </para>
 </section>
 </section>
-    
+
    <section xml:id="trouble.casestudy">
      <title>Case Study</title>
      <para>The issue described in this case study is somewhat exotic, but the thought process should 
      provide a useful blueprint on diagnosing cluster issues.</para>
      <section><title>Scenario</title>
        <para>Following a scheduled reboot, one data node began exhibiting unusual behavior.  Routine MapReduce 
         jobs run against HBase tables which regularly completed in five or six minutes began taking 30 or 40 minutes 
         to finish. These jobs were consistently found to be waiting on map and reduce tasks assigned to the troubled data node 
         (e.g., the slow map tasks all had the same Input Split).           
         The situation came to a head during a distributed copy, when the copy was severely prolonged by the lagging node.
 		</para>
       </section>
      <section><title>Hardware</title>
        <para>Datanodes:
        <itemizedlist>
          <listitem>Two 12-core processors</listitem>
          <listitem>Six Enerprise SATA disks</listitem>
          <listitem>24GB of RAM</listitem>
          <listitem>Two bonded gigabit NICs</listitem>
        </itemizedlist>
        </para>		
        <para>Network:
        <itemizedlist>
          <listitem>10 Gigabit top-of-rack switches</listitem>
          <listitem>20 Gigabit bonded interconnects between racks.</listitem>
        </itemizedlist>
        </para>
      </section>
      <section><title>Hypotheses</title>
 		<section><title>HBase "Hot Spot" Region</title>
 		  <para>We hypothesized that we were experiencing a familiar point of pain: a "hot spot" region in an HBase table, 
 		  where uneven key-space distribution can funnel a huge number of requests to a single HBase region, bombarding the RegionServer 
 		  process and cause slow response time. Examination of the HBase Master status page showed that the number of HBase requests to the 
 		  troubled node was almost zero.  Further, examination of the HBase logs showed that there were no region splits, compactions, or other region transitions 
 		  in progress.  This effectively ruled out a "hot spot" as the root cause of the observed slowness.
          </para>		
        </section>
 		<section><title>HBase Region With Non-Local Data</title>
 		  <para>Our next hypothesis was that one of the MapReduce tasks was requesting data from HBase that was not local to the datanode, thus 
 		  forcing HDFS to request data blocks from other servers over the network.  Examination of the datanode logs showed that there were very 
 		  few blocks being requested over the network, indicating that the HBase region was correctly assigned, and that the majority of the necessary 
 		  data was located on the node. This ruled out the possibility of non-local data causing a slowdown.
          </para>
        </section>		
 		<section><title>Excessive I/O Wait Due To Swapping Or An Over-Worked Or Failing Hard Disk</title>
          <para>After concluding that the Hadoop and HBase were not likely to be the culprits, we moved on to troubleshooting the datanode's hardware. 
          Java, by design, will periodically scan its entire memory space to do garbage collection.  If system memory is heavily overcommitted, the Linux 
          kernel may enter a vicious cycle, using up all of its resources swapping Java heap back and forth from disk to RAM as Java tries to run garbage 
          collection.  Further, a failing hard disk will often retry reads and/or writes many times before giving up and returning an error. This can manifest 
          as high iowait, as running processes wait for reads and writes to complete.  Finally, a disk nearing the upper edge of its performance envelope will 
          begin to cause iowait as it informs the kernel that it cannot accept any more data, and the kernel queues incoming data into the dirty write pool in memory.  
          However, using <code>vmstat(1)</code> and <code>free(1)</code>, we could see that no swap was being used, and the amount of disk IO was only a few kilobytes per second.
          </para>		
        </section>
 		<section><title>Slowness Due To High Processor Usage</title>
          <para>Next, we checked to see whether the system was performing slowly simply due to very high computational load.  <code>top(1)</code> showed that the system load 
          was higher than normal, but <code>vmstat(1)</code> and <code>mpstat(1)</code> showed that the amount of processor being used for actual computation was low.
          </para>	
        </section>	
 		<section><title>Network Saturation (The Winner)</title>
          <para>Since neither the disks nor the processors were being utilized heavily, we moved on to the performance of the network interfaces.  The datanode had two 
          gigabit ethernet adapters, bonded to form an active-standby interface.  <code>ifconfig(8)</code> showed some unusual anomalies, namely interface errors, overruns, framing errors. 
          While not unheard of, these kinds of errors are exceedingly rare on modern hardware which is operating as it should:
 <programlisting>		
 $ /sbin/ifconfig bond0
 bond0  Link encap:Ethernet  HWaddr 00:00:00:00:00:00  
 inet addr:10.x.x.x  Bcast:10.x.x.255  Mask:255.255.255.0
 UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
 RX packets:2990700159 errors:12 dropped:0 overruns:1 frame:6          &lt;--- Look Here! Errors!
 TX packets:3443518196 errors:0 dropped:0 overruns:0 carrier:0
 collisions:0 txqueuelen:0 
 RX bytes:2416328868676 (2.4 TB)  TX bytes:3464991094001 (3.4 TB)
 </programlisting>
          </para>		
          <para>These errors immediately lead us to suspect that one or more of the ethernet interfaces might have negotiated the wrong line speed.  This was confirmed both by running an ICMP ping 
          from an external host and observing round-trip-time in excess of 700ms, and by running <code>ethtool(8)</code> on the members of the bond interface and discovering that the active interface 
          was operating at 100Mbs/, full duplex.
 <programlisting>		
 $ sudo ethtool eth0
 Settings for eth0:
 Supported ports: [ TP ]
 Supported link modes:   10baseT/Half 10baseT/Full 
                       100baseT/Half 100baseT/Full 
                       1000baseT/Full 
 Supports auto-negotiation: Yes
 Advertised link modes:  10baseT/Half 10baseT/Full 
                       100baseT/Half 100baseT/Full 
                       1000baseT/Full 
 Advertised pause frame use: No
 Advertised auto-negotiation: Yes
 Link partner advertised link modes:  Not reported
 Link partner advertised pause frame use: No
 Link partner advertised auto-negotiation: No
 Speed: 100Mb/s                                     &lt;--- Look Here!  Should say 1000Mb/s!
 Duplex: Full
 Port: Twisted Pair
 PHYAD: 1
 Transceiver: internal
 Auto-negotiation: on
 MDI-X: Unknown
 Supports Wake-on: umbg
 Wake-on: g
 Current message level: 0x00000003 (3)
 Link detected: yes
 </programlisting>		
 		  </para>
 		  <para>In normal operation, the ICMP ping round trip time should be around 20ms, and the interface speed and duplex should read, "1000MB/s", and, "Full", respectively.  
 		  </para>
 	    </section>
     </section>  
   	<section><title>Resolution</title>
   	  <para>After determining that the active ethernet adapter was at the incorrect speed, we used the <code>ifenslave(8)</code> command to make the standby interface 
   	  the active interface, which yielded an immediate improvement in MapReduce performance, and a 10 times improvement in network throughput:
 	  </para>
 	  <para>On the next trip to the datacenter, we determined that the line speed issue was ultimately caused by a bad network cable, which was replaced.
 	  </para>
 	</section>
   </section>  <!--  case study -->
  </chapter>