HBASE-4598 book update (book.xml, perf.xml, trouble.xml)

git-svn-id: https://svn.apache.org/repos/asf/hbase/trunk@1184830 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
Doug Meil 2011-10-16 14:16:31 +00:00
parent 8adbd7f62e
commit 6f65994f51
3 changed files with 63 additions and 14 deletions

View File

@ -1316,7 +1316,7 @@ scan.setFilter(filter);
<section xml:id="master"><title>Master</title>
<para><code>HMaster</code> is the implementation of the Master Server. The Master server
is responsible for monitoring all RegionServer instances in the cluster, and is
the interface for all metadata changes.
the interface for all metadata changes. In a distributed cluster, the Master typically runs on the <xref linkend="arch.hdfs.nn" />.
</para>
<section xml:id="master.startup"><title>Startup Behavior</title>
<para>If run in a multi-Master environment, all Masters compete to run the cluster. If the active
@ -1353,6 +1353,7 @@ scan.setFilter(filter);
</section>
<section xml:id="regionserver.arch"><title>RegionServer</title>
<para><code>HRegionServer</code> is the RegionServer implementation. It is responsible for serving and managing regions.
In a distributed cluster, a RegionServer runs on a <xref linkend="arch.hdfs.dn" />.
</para>
<section xml:id="regionserver.arch.api"><title>Interface</title>
<para>The methods exposed by <code>HRegionRegionInterface</code> contain both data-oriented and region-maintenance methods:
@ -1711,6 +1712,27 @@ scan.setFilter(filter);
</section> <!-- bloom -->
</section>
<section xml:id="arch.hdfs"><title>HDFS</title>
<para>As HBase runs on HDFS (and each StoreFile is written as a file on HDFS),
it is important to have an understanding of the HDFS Architecture
especially in terms of how it stores files, handles failovers, and replicates blocks.
</para>
<para>See the Hadoop documentation on <link xlink:href="http://hadoop.apache.org/common/docs/current/hdfs_design.html">HDFS Architecture</link>
for more information.
</para>
<section xml:id="arch.hdfs.nn"><title>NameNode</title>
<para>The NameNode is responsible for maintaining the filesystem metadata. See the above HDFS Architecture link
for more information.
</para>
</section>
<section xml:id="arch.hdfs.dn"><title>DataNode</title>
<para>The DataNodes are responsible for storing HDFS blocks. See the above HDFS Architecture link
for more information.
</para>
</section>
</section>
</chapter> <!-- architecture -->
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="external_apis.xml" />
@ -1889,15 +1911,15 @@ hbase> describe 't1'</programlisting>
</answer>
</qandaentry>
</qandadiv>
<qandadiv xml:id="ec2"><title>EC2</title>
<qandadiv xml:id="ec2"><title>Amazon EC2</title>
<qandaentry>
<question><para>
Why doesn't my remote java connection into my ec2 cluster work?
I am running HBase on Amazon EC2 and...
</para></question>
<answer>
<para>
See Andrew's answer here, up on the user list: <link xlink:href="http://search-hadoop.com/m/sPdqNFAwyg2">Remote Java client connection into EC2 instance</link>.
</para>
See Troubleshooting <xref linkend="trouble.ec2" /> and Performance <xref linkend="perf.ec2" /> sections.
</para>
</answer>
</qandaentry>
</qandadiv>

View File

@ -410,14 +410,34 @@ htable.close();</programlisting></para>
</section>
</section> <!-- deleting -->
<section xml:id="perf.ec2"><title>Amazon EC2</title>
<para>Performance questions are common on Amazon EC2 environments because it is is a shared environment. You will
not see the same throughput as a dedicated server. In terms of running tests on EC2, run them several times for the same
reason (i.e., it's a shared environment and you don't know what else is happening on the server).
</para>
<para>If you are running on EC2 and post performance questions on the dist-list, please state this fact up-front that
because EC2 issues are practically a separate class of performance issues.
<section xml:id="perf.hdfs"><title>HDFS</title>
<para>Because HBase runs on <xref linkend="arch.hdfs" /> it is important to understand how it works and how it affects
HBase.
</para>
<section xml:id="perf.hdfs.curr"><title>Current Issues With Low-Latency Reads</title>
<para>The original use-case for HDFS was batch processing. As such, there low-latency reads were historically not a priority.
With the increased adoption of HBase this is changing, and several improvements are already in development.
See the
<link xlink:href="https://issues.apache.org/jira/browse/HDFS-1599">Umbrella Jira Ticket for HDFS Improvements for HBase</link>.
</para>
</section>
<section xml:id="perf.hdfs.comp"><title>Performance Comparisons of HBase vs. HDFS</title>
<para>A fairly common question on the dist-list is why HBase isn't as performant as HDFS files in a batch context (e.g., as
a MapReduce source or sink). The short answer is that HBase is doing a lot more than HDFS (e.g., reading the KeyValues,
returning the most current row or specified timestamps, etc.), and as such HBase is 4-5 times slower than HDFS in this
processing context. Not that there isn't room for improvement (and this gap will, over time, be reduced), but HDFS
will always be faster in this use-case.
</para>
</section>
</section>
</para>
<section xml:id="perf.ec2"><title>Amazon EC2</title>
<para>Performance questions are common on Amazon EC2 environments because it is is a shared environment. You will
not see the same throughput as a dedicated server. In terms of running tests on EC2, run them several times for the same
reason (i.e., it's a shared environment and you don't know what else is happening on the server).
</para>
<para>If you are running on EC2 and post performance questions on the dist-list, please state this fact up-front that
because EC2 issues are practically a separate class of performance issues.
</para>
</section>
</chapter>

View File

@ -793,6 +793,13 @@ ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expi
<para>Questions on HBase and Amazon EC2 come up frequently on the HBase dist-list. Search for old threads using <link xlink:href="http://search-hadoop.com/">Search Hadoop</link>
</para>
</section>
<section xml:id="trouble.ec2.connection">
<title>Remote Java Connection into EC2 Cluster Not Working</title>
<para>
See Andrew's answer here, up on the user list: <link xlink:href="http://search-hadoop.com/m/sPdqNFAwyg2">Remote Java client connection into EC2 instance</link>.
</para>
</section>
</section>
</chapter>