HBASE-4598 book update (book.xml, perf.xml, trouble.xml)

git-svn-id: https://svn.apache.org/repos/asf/hbase/trunk@1184830 13f79535-47bb-0310-9956-ffa450edef68
2011-10-16 14:16:31 +00:00 · 2011-10-16 14:16:31 +00:00 · 6f65994f51
parent 8adbd7f62e
commit 6f65994f51
3 changed files with 63 additions and 14 deletions
--- a/src/docbkx/book.xml
+++ b/src/docbkx/book.xml
@ -1316,7 +1316,7 @@ scan.setFilter(filter);
    <section xml:id="master"><title>Master</title>
       <para><code>HMaster</code> is the implementation of the Master Server.  The Master server
       is responsible for monitoring all RegionServer instances in the cluster, and is
-       the interface for all metadata changes.
+       the interface for all metadata changes.  In a distributed cluster, the Master typically runs on the <xref linkend="arch.hdfs.nn" />.
       </para>
       <section xml:id="master.startup"><title>Startup Behavior</title>
         <para>If run in a multi-Master environment, all Masters compete to run the cluster.  If the active
@ -1352,7 +1352,8 @@ scan.setFilter(filter);

     </section>
     <section xml:id="regionserver.arch"><title>RegionServer</title>
-       <para><code>HRegionServer</code> is the RegionServer implementation.  It is responsible for serving and managing regions.  
+       <para><code>HRegionServer</code> is the RegionServer implementation.  It is responsible for serving and managing regions.
+       In a distributed cluster, a RegionServer runs on a <xref linkend="arch.hdfs.dn" />.  
       </para>
       <section xml:id="regionserver.arch.api"><title>Interface</title>
         <para>The methods exposed by <code>HRegionRegionInterface</code> contain both data-oriented and region-maintenance methods:
@ -1711,6 +1712,27 @@ scan.setFilter(filter);
     </section>   <!--  bloom  -->  
     
    </section>
+    
+    <section xml:id="arch.hdfs"><title>HDFS</title>
+       <para>As HBase runs on HDFS (and each StoreFile is written as a file on HDFS),
+        it is important to have an understanding of the HDFS Architecture
+         especially in terms of how it stores files, handles failovers, and replicates blocks.
+       </para>
+       <para>See the Hadoop documentation on <link xlink:href="http://hadoop.apache.org/common/docs/current/hdfs_design.html">HDFS Architecture</link>
+       for more information.
+       </para>
+       <section xml:id="arch.hdfs.nn"><title>NameNode</title>
+         <para>The NameNode is responsible for maintaining the filesystem metadata.  See the above HDFS Architecture link
+         for more information.
+         </para>
+       </section>
+       <section xml:id="arch.hdfs.dn"><title>DataNode</title>
+         <para>The DataNodes are responsible for storing HDFS blocks.  See the above HDFS Architecture link
+         for more information.
+         </para>
+       </section>
+    </section>       
+    
  </chapter>   <!--  architecture -->
  
  <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="external_apis.xml" />
@ -1889,15 +1911,15 @@ hbase> describe 't1'</programlisting>
            </answer>
        </qandaentry>
    </qandadiv>
-    <qandadiv xml:id="ec2"><title>EC2</title>
+    <qandadiv xml:id="ec2"><title>Amazon EC2</title>
        <qandaentry>
            <question><para>
-            Why doesn't my remote java connection into my ec2 cluster work?
+            I am running HBase on Amazon EC2 and...
            </para></question>
            <answer>
                <para>
-          See Andrew's answer here, up on the user list: <link xlink:href="http://search-hadoop.com/m/sPdqNFAwyg2">Remote Java client connection into EC2 instance</link>.
-                </para>
+ 	            See Troubleshooting <xref linkend="trouble.ec2" /> and Performance <xref linkend="perf.ec2" /> sections.                
+               </para>
            </answer>
        </qandaentry>
    </qandadiv>
--- a/src/docbkx/performance.xml
+++ b/src/docbkx/performance.xml
@ -409,15 +409,35 @@ htable.close();</programlisting></para>
       </para>
     </section>
  </section>  <!--  deleting -->
+
+  <section xml:id="perf.hdfs"><title>HDFS</title>
+   <para>Because HBase runs on <xref linkend="arch.hdfs" /> it is important to understand how it works and how it affects
+   HBase.
+   </para>
+    <section xml:id="perf.hdfs.curr"><title>Current Issues With Low-Latency Reads</title>
+      <para>The original use-case for HDFS was batch processing.  As such, there low-latency reads were historically not a priority.
+      With the increased adoption of HBase this is changing, and several improvements are already in development.
+      See the 
+      <link xlink:href="https://issues.apache.org/jira/browse/HDFS-1599">Umbrella Jira Ticket for HDFS Improvements for HBase</link>.
+      </para>
+    </section>
+    <section xml:id="perf.hdfs.comp"><title>Performance Comparisons of HBase vs. HDFS</title>
+     <para>A fairly common question on the dist-list is why HBase isn't as performant as HDFS files in a batch context (e.g., as 
+     a MapReduce source or sink).  The short answer is that HBase is doing a lot more than HDFS (e.g., reading the KeyValues, 
+     returning the most current row or specified timestamps, etc.), and as such HBase is 4-5 times slower than HDFS in this 
+     processing context.  Not that there isn't room for improvement (and this gap will, over time, be reduced), but HDFS
+      will always be faster in this use-case.
+     </para>
+    </section>
+  </section>
  
  <section xml:id="perf.ec2"><title>Amazon EC2</title>
-  <para>Performance questions are common on Amazon EC2 environments because it is is a shared environment.  You will
-  not see the same throughput as a dedicated server.  In terms of running tests on EC2, run them several times for the same
-  reason (i.e., it's a shared environment and you don't know what else is happening on the server).
-  </para>
-  <para>If you are running on EC2 and post performance questions on the dist-list, please state this fact up-front that
-   because EC2 issues are practically a separate class of performance issues.
-  
-  </para>
+   <para>Performance questions are common on Amazon EC2 environments because it is is a shared environment.  You will
+   not see the same throughput as a dedicated server.  In terms of running tests on EC2, run them several times for the same
+   reason (i.e., it's a shared environment and you don't know what else is happening on the server).
+   </para>
+   <para>If you are running on EC2 and post performance questions on the dist-list, please state this fact up-front that
+    because EC2 issues are practically a separate class of performance issues.
+   </para>
  </section>
 </chapter>
--- a/src/docbkx/troubleshooting.xml
+++ b/src/docbkx/troubleshooting.xml
@ -793,6 +793,13 @@ ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expi
             <para>Questions on HBase and Amazon EC2 come up frequently on the HBase dist-list. Search for old threads using <link xlink:href="http://search-hadoop.com/">Search Hadoop</link>
             </para>
          </section>
+          <section xml:id="trouble.ec2.connection">
+             <title>Remote Java Connection into EC2 Cluster Not Working</title>
+             <para>
+             See Andrew's answer here, up on the user list: <link xlink:href="http://search-hadoop.com/m/sPdqNFAwyg2">Remote Java client connection into EC2 instance</link>.
+             </para>
+          </section>
+          
    </section>
    
  </chapter>