From 6f65994f51f0cd95a791310fbc4d013b82b199d6 Mon Sep 17 00:00:00 2001
From: Doug Meil <dmeil@apache.org>
Date: Sun, 16 Oct 2011 14:16:31 +0000
Subject: [PATCH] HBASE-4598 book update (book.xml, perf.xml, trouble.xml)

git-svn-id: https://svn.apache.org/repos/asf/hbase/trunk@1184830 13f79535-47bb-0310-9956-ffa450edef68
---
 src/docbkx/book.xml            | 34 ++++++++++++++++++++++++++------
 src/docbkx/performance.xml     | 36 ++++++++++++++++++++++++++--------
 src/docbkx/troubleshooting.xml |  7 +++++++
 3 files changed, 63 insertions(+), 14 deletions(-)
diff --git a/src/docbkx/book.xml b/src/docbkx/book.xml
index 62a3514c27e..21348888c35 100644
--- a/src/docbkx/book.xml
+++ b/src/docbkx/book.xml
@@ -1316,7 +1316,7 @@ scan.setFilter(filter);
     <section xml:id="master"><title>Master</title>
        <para><code>HMaster</code> is the implementation of the Master Server.  The Master server
        is responsible for monitoring all RegionServer instances in the cluster, and is
-       the interface for all metadata changes.
+       the interface for all metadata changes.  In a distributed cluster, the Master typically runs on the <xref linkend="arch.hdfs.nn" />.
        </para>
        <section xml:id="master.startup"><title>Startup Behavior</title>
          <para>If run in a multi-Master environment, all Masters compete to run the cluster.  If the active
@@ -1352,7 +1352,8 @@ scan.setFilter(filter);
 
      </section>
      <section xml:id="regionserver.arch"><title>RegionServer</title>
-       <para><code>HRegionServer</code> is the RegionServer implementation.  It is responsible for serving and managing regions.  
+       <para><code>HRegionServer</code> is the RegionServer implementation.  It is responsible for serving and managing regions.
+       In a distributed cluster, a RegionServer runs on a <xref linkend="arch.hdfs.dn" />.  
        </para>
        <section xml:id="regionserver.arch.api"><title>Interface</title>
          <para>The methods exposed by <code>HRegionRegionInterface</code> contain both data-oriented and region-maintenance methods:
@@ -1711,6 +1712,27 @@ scan.setFilter(filter);
      </section>   <!--  bloom  -->  
      
     </section>
+    
+    <section xml:id="arch.hdfs"><title>HDFS</title>
+       <para>As HBase runs on HDFS (and each StoreFile is written as a file on HDFS),
+        it is important to have an understanding of the HDFS Architecture
+         especially in terms of how it stores files, handles failovers, and replicates blocks.
+       </para>
+       <para>See the Hadoop documentation on <link xlink:href="http://hadoop.apache.org/common/docs/current/hdfs_design.html">HDFS Architecture</link>
+       for more information.
+       </para>
+       <section xml:id="arch.hdfs.nn"><title>NameNode</title>
+         <para>The NameNode is responsible for maintaining the filesystem metadata.  See the above HDFS Architecture link
+         for more information.
+         </para>
+       </section>
+       <section xml:id="arch.hdfs.dn"><title>DataNode</title>
+         <para>The DataNodes are responsible for storing HDFS blocks.  See the above HDFS Architecture link
+         for more information.
+         </para>
+       </section>
+    </section>       
+    
   </chapter>   <!--  architecture -->
   
   <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="external_apis.xml" />
@@ -1889,15 +1911,15 @@ hbase> describe 't1'</programlisting>
             </answer>
         </qandaentry>
     </qandadiv>
-    <qandadiv xml:id="ec2"><title>EC2</title>
+    <qandadiv xml:id="ec2"><title>Amazon EC2</title>
         <qandaentry>
             <question><para>
-            Why doesn't my remote java connection into my ec2 cluster work?
+            I am running HBase on Amazon EC2 and...
             </para></question>
             <answer>
                 <para>
-          See Andrew's answer here, up on the user list: <link xlink:href="http://search-hadoop.com/m/sPdqNFAwyg2">Remote Java client connection into EC2 instance</link>.
-                </para>
+ 	            See Troubleshooting <xref linkend="trouble.ec2" /> and Performance <xref linkend="perf.ec2" /> sections.                
+               </para>
             </answer>
         </qandaentry>
     </qandadiv>
diff --git a/src/docbkx/performance.xml b/src/docbkx/performance.xml
index 6c0a0bbd33c..1dfd4db8a36 100644
--- a/src/docbkx/performance.xml
+++ b/src/docbkx/performance.xml
@@ -409,15 +409,35 @@ htable.close();</programlisting></para>
        </para>
      </section>
   </section>  <!--  deleting -->
+
+  <section xml:id="perf.hdfs"><title>HDFS</title>
+   <para>Because HBase runs on <xref linkend="arch.hdfs" /> it is important to understand how it works and how it affects
+   HBase.
+   </para>
+    <section xml:id="perf.hdfs.curr"><title>Current Issues With Low-Latency Reads</title>
+      <para>The original use-case for HDFS was batch processing.  As such, there low-latency reads were historically not a priority.
+      With the increased adoption of HBase this is changing, and several improvements are already in development.
+      See the 
+      <link xlink:href="https://issues.apache.org/jira/browse/HDFS-1599">Umbrella Jira Ticket for HDFS Improvements for HBase</link>.
+      </para>
+    </section>
+    <section xml:id="perf.hdfs.comp"><title>Performance Comparisons of HBase vs. HDFS</title>
+     <para>A fairly common question on the dist-list is why HBase isn't as performant as HDFS files in a batch context (e.g., as 
+     a MapReduce source or sink).  The short answer is that HBase is doing a lot more than HDFS (e.g., reading the KeyValues, 
+     returning the most current row or specified timestamps, etc.), and as such HBase is 4-5 times slower than HDFS in this 
+     processing context.  Not that there isn't room for improvement (and this gap will, over time, be reduced), but HDFS
+      will always be faster in this use-case.
+     </para>
+    </section>
+  </section>
   
   <section xml:id="perf.ec2"><title>Amazon EC2</title>
-  <para>Performance questions are common on Amazon EC2 environments because it is is a shared environment.  You will
-  not see the same throughput as a dedicated server.  In terms of running tests on EC2, run them several times for the same
-  reason (i.e., it's a shared environment and you don't know what else is happening on the server).
-  </para>
-  <para>If you are running on EC2 and post performance questions on the dist-list, please state this fact up-front that
-   because EC2 issues are practically a separate class of performance issues.
-  
-  </para>
+   <para>Performance questions are common on Amazon EC2 environments because it is is a shared environment.  You will
+   not see the same throughput as a dedicated server.  In terms of running tests on EC2, run them several times for the same
+   reason (i.e., it's a shared environment and you don't know what else is happening on the server).
+   </para>
+   <para>If you are running on EC2 and post performance questions on the dist-list, please state this fact up-front that
+    because EC2 issues are practically a separate class of performance issues.
+   </para>
   </section>
 </chapter>
diff --git a/src/docbkx/troubleshooting.xml b/src/docbkx/troubleshooting.xml
index 1ba03fe0045..fd757d489f2 100644
--- a/src/docbkx/troubleshooting.xml
+++ b/src/docbkx/troubleshooting.xml
@@ -793,6 +793,13 @@ ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expi
              <para>Questions on HBase and Amazon EC2 come up frequently on the HBase dist-list. Search for old threads using <link xlink:href="http://search-hadoop.com/">Search Hadoop</link>
              </para>
           </section>
+          <section xml:id="trouble.ec2.connection">
+             <title>Remote Java Connection into EC2 Cluster Not Working</title>
+             <para>
+             See Andrew's answer here, up on the user list: <link xlink:href="http://search-hadoop.com/m/sPdqNFAwyg2">Remote Java client connection into EC2 instance</link>.
+             </para>
+          </section>
+          
     </section>
     
   </chapter>