hbase-4939 book.xml (architecture/faq), troubleshooting.xml (created resources section)

git-svn-id: https://svn.apache.org/repos/asf/hbase/trunk@1209688 13f79535-47bb-0310-9956-ffa450edef68
2011-12-02 20:57:35 +00:00 · 2011-12-02 20:57:35 +00:00 · 7df968e09a
parent b3d87416d2
commit 7df968e09a
2 changed files with 90 additions and 28 deletions
--- a/src/docbkx/book.xml
+++ b/src/docbkx/book.xml
@ -1200,6 +1200,63 @@ if (!b) {
  <chapter xml:id="architecture">
    <title>Architecture</title>
 	<section xml:id="arch.overview">
 	<title>Overview</title>
 	  <section xml:id="arch.overview.nosql">
 	  <title>NoSQL?</title>
 	  <para>HBase is a type of "NoSQL" database.  "NoSQL" is a general term meaning that the database isn't an RDBMS which
 	  supports SQL as it's primary access language, but there are many types of NoSQL databases:  BerkeleyDB is an 
 	  example of a local NoSQL database, whereas HBase is very much a distributed database.  Technically speaking,
 	  HBase is really more a "Data Store" than "Data Base" because it lacks many of the features you find in an RDBMS,
 	  such as typed columns, secondary indexes, triggers, and advanced query languages, etc.
 	  </para>
 	  <para>However, HBase has many features which supports both linear and modular scaling.  HBase clusters expand
 	  by adding RegionServers that are hosted on commodity class servers. If a cluster expands from 10 to 20 
 	  RegionServers, for example, it doubles both in terms of storage and as well as processing capacity.
 	  RDBMS can scale well, but only up to a point - specifically, the size of a single database server - and for the best
 	  performance requires specialized hardware and storage devices.  HBase features of note are:
 	        <itemizedlist>
              <listitem>Strongly consistent reads/writes:  HBase is not an "eventually consistent" DataStore.  This 
              makes it very suitable for tasks such as high-speed counter aggregation.  </listitem>
              <listitem>Automatic sharding:  HBase tables are distributed on the cluster via regions, and regions are
              automatically split and re-distributed as your data grows.</listitem>
              <listitem>Automatic RegionServer failover</listitem>
              <listitem>Hadoop/HDFS Integration:  HBase supports HDFS out of the box as it's distributed file system.</listitem>
              <listitem>MapReduce:  HBase supports massively parallelized processing via MapReduce for using HBase as both 
              source and sink.</listitem>
              <listitem>Java Client API:  HBase supports an easy to use Java API for programmatic access.</listitem>
              <listitem>Thrift/REST API:  HBase also supports Thrift and REST for non-Java front-ends.</listitem>
              <listitem>Block Cache and Bloom Filters:  HBase supports a Block Cache and Bloom Filters for high volume query optimization.</listitem>
              <listitem>Operational Management:  HBase provides build-in web-pages for operational insight as well as JMX metrics.</listitem>
            </itemizedlist>
 	  </para>
      </section>      
 	  <section xml:id="arch.overview.when">
 	    <title>When Should I Use HBase?</title>
 	          <para>First, make sure you have enough data.  HBase isn't suitable for every problem.  If you have 
                hundreds of millions or billions of rows, then HBase is a good candidate.  If you only have a few 
                thousand/million rows, then using a traditional RDBMS might be a better choice due to the 
                fact that all of your data might wind up on a single node (or two) and the rest of the cluster may
                be sitting idle.
 	          </para>
 	          <para>Second, make sure you have enough hardware.  Even HDFS doesn't do well with anything less than
                5 DataNodes (due to things such as HDFS block replication which has a default of 3), plus a NameNode.
                </para>
                <para>HBase can run quite well stand-alone on a laptop - but this should be considered a development
                configuration only.
                </para>
      </section>
      <section xml:id="arch.overview.hbasehdfs">
        <title>What Is The Difference Between HBase and Hadoop/HDFS?</title>
          <para><link xlink:href="http://hadoop.apache.org/hdfs/">HDFS</link> is a distributed file system that is well suited for the storage of large files. 
          It's documentation states that it is not, however, a general purpose file system, and does not provide fast individual record lookups in files. 
          HBase, on the other hand, is built on top of HDFS and provides fast record lookups (and updates) for large tables. 
          This can sometimes be a point of conceptual confusion.  HBase internally puts your data in indexed "StoreFiles" that exist
          on HDFS for high-speed lookups.  See the <xref linkend="datamodel" /> and the rest of this chapter for more information on how HBase achieves its goals.
         </para>
      </section>
 	</section>
 	<section xml:id="arch.catalog">
 	 <title>Catalog Tables</title>
@ -2000,17 +2057,7 @@ hbase> describe 't1'</programlisting>
        <qandaentry>
                <question><para>When should I use HBase?</para></question>
            <answer>
-                <para>
+                <para>See the <xref linkend="arch.overview" /> in the Architecture chapter.
              Anybody can download and give HBase a spin, even on a laptop.  The scope of this answer is when 
              would it be best to use HBase in a <emphasis>real</emphasis> deployment.
                </para>
                <para>First, make sure you have enough hardware.  Even HDFS doesn't do well with anything less than
                5 DataNodes (due to things such as HDFS block replication which has a default of 3), plus a NameNode.
                Second, make sure you have enough data.  HBase isn't suitable for every problem.  If you have 
                hundreds of millions or billions of rows, then HBase is a good candidate.  If you only have a few 
                thousand/million rows, then using a traditional RDBMS might be a better choice due to the 
                fact that all of your data might wind up on a single node (or two) and the rest of the cluster may
                be sitting idle.
                </para>
            </answer>
        </qandaentry>
@ -2031,17 +2078,6 @@ hbase> describe 't1'</programlisting>
                </para>
            </answer>
        </qandaentry>
        <qandaentry xml:id="faq.hdfs.hbase">
            <question><para>How does HBase work on top of HDFS?</para></question>
            <answer>
                <para>
                    <link xlink:href="http://hadoop.apache.org/hdfs/">HDFS</link> is a distributed file system that is well suited for the storage of large files.  It's documentation 
                    states that it is not, however, a general purpose file system, and does not provide fast individual record lookups in files. 
                    HBase, on the other hand, is built on top of HDFS and provides fast record lookups (and updates) for large tables.  This can sometimes be a point of conceptual confusion.
                    See the <xref linkend="datamodel" /> and <xref linkend="architecture" /> sections for more information on how HBase achieves its goals.
                </para>
            </answer>
        </qandaentry>
    </qandadiv>
    <qandadiv xml:id="faq.config"><title>Configuration</title>
        <qandaentry xml:id="faq.config.started">
@ -2109,6 +2145,16 @@ hbase> describe 't1'</programlisting>
            </answer>
        </qandaentry>
    </qandadiv>
    <qandadiv xml:id="faq.mapreduce"><title>MapReduce</title>
        <qandaentry xml:id="faq.mapreduce.use">
            <question><para>How can I use MapReduce with HBase?</para></question>
            <answer>
                <para>
                    See <xref linkend="mapreduce" />
                </para>
            </answer>
        </qandaentry>
    </qandadiv>
    <qandadiv><title>Performance and Troubleshooting</title>
        <qandaentry>
            <question><para>
--- a/src/docbkx/troubleshooting.xml
+++ b/src/docbkx/troubleshooting.xml
@ -196,6 +196,28 @@ export HBASE_OPTS="-XX:NewSize=64m -XX:MaxNewSize=64m &lt;cms options from above
            </para>                  
      </section>
    </section>
    <section xml:id="trouble.resources">
      <title>Resources</title>
      <section xml:id="trouble.resources.lists">
        <title>Dist-Lists</title>
        <para>Sign up for the <link xlink:href="http://hbase.apache.org/mail-lists.html">HBase Dist-Lists</link> and post a question.  'Dev' is aimed at the
        community of developers actually building HBase and for features currently under development, and 'User' for generally used for questions on released
        versions of HBase.
        </para>
      </section>
      <section xml:id="trouble.resources.searchhadoop">
        <title>search-hadoop.com</title>
        <para>
        <link xlink:href="http://search-hadoop.com">search-hadoop.com</link> indexes all the mailing lists and is great for historical searches.  
        </para>
      </section>
      <section xml:id="trouble.resources.jira">
        <title>JIRA</title>
        <para>
        <link xlink:href="https://issues.apache.org/jira/browse/HBASE">JIRA</link> is also really helpful when looking for Hadoop/HBase-specific issues.
        </para>
      </section>
    </section>
    <section xml:id="trouble.tools">
      <title>Tools</title>
         <section xml:id="trouble.tools.builtin">
@ -221,12 +243,6 @@ export HBASE_OPTS="-XX:NewSize=64m -XX:MaxNewSize=64m &lt;cms options from above
       </section>
       <section xml:id="trouble.tools.external">
          <title>External Tools</title>
       <section xml:id="trouble.tools.searchhadoop">
        <title>search-hadoop.com</title>
        <para>
        <link xlink:href="http://search-hadoop.com">search-hadoop.com</link> indexes all the mailing lists and <link xlink:href="https://issues.apache.org/jira/browse/HBASE">JIRA</link>, it’s really helpful when looking for Hadoop/HBase-specific issues.
        </para>
      </section>
      <section xml:id="trouble.tools.tail">
        <title>tail</title>
        <para>