HBASE-4566 book.xml,ops_mgt.xml - KeyValue documentation

git-svn-id: https://svn.apache.org/repos/asf/hbase/trunk@1181091 13f79535-47bb-0310-9956-ffa450edef68
2011-10-10 17:41:53 +00:00 · 2011-10-10 17:41:53 +00:00 · 5364f98a32
parent f4d8833824
commit 5364f98a32
2 changed files with 64 additions and 2 deletions
--- a/src/docbkx/book.xml
+++ b/src/docbkx/book.xml
@ -312,7 +312,7 @@ public static class MyReducer extends TableReducer&lt;Text, IntWritable, Immutab
      <para>A good general introduction on the strength and weaknesses modelling on
          the various non-rdbms datastores is Ian Varleys' Master thesis,
          <link xlink:href="http://ianvarley.com/UT/MR/Varley_MastersReport_Full_2009-08-07.pdf">No Relation: The Mixed Blessings of Non-Relational Databases</link>.
-          Recommended.
+          Recommended.  Also, read <xref linkend="keyvalue"/> for how HBase stores data internally.
      </para>
  <section xml:id="schema.creation">
  <title>
@ -400,7 +400,7 @@ admin.enableTable(table);
       </para>
       <para>Most of the time small inefficiencies don't matter all that much.  Unfortunately,
         this is a case where they do.  Whatever patterns are selected for ColumnFamilies, attributes, and rowkeys they could be repeated
-       several billion times in your data</para>
+       several billion times in your data.  See <xref linkend="keyvalue"/> for more information on HBase stores data internally.</para>
       <section xml:id="keysize.cf"><title>Column Families</title>
         <para>Try to keep the ColumnFamily names as small as possible, preferably one character (e.g. "d" for data/default).
         </para> 
@ -1615,6 +1615,8 @@ scan.setFilter(filter);
              Schubert Zhang's blog post on <link xlink:ref="http://cloudepr.blogspot.com/2009/09/hfile-block-indexed-file-format-to.html">HFile: A Block-Indexed File Format to Store Sorted Key-Value Pairs</link> makes for a thorough introduction to HBase's hfile.  Matteo Bertozzi has also put up a
              helpful description, <link xlink:href="http://th30z.blogspot.com/2011/02/hbase-io-hfile.html?spref=tw">HBase I/O: HFile</link>.
          </para>
          <para>For more information, see the <link xlink:href="http://hbase.apache.org/xref/org/apache/hadoop/hbase/io/hfile/HFile.html">HFile source code</link>.
          </para>
      </section>
      <section xml:id="hfile_tool">
@ -1631,6 +1633,40 @@ scan.setFilter(filter);
        tool.</para>
      </section>
      </section>
      <section xml:id="hfile.blocks">
        <title>Blocks</title>
        <para>StoreFiles are composed of blocks.  The blocksize is configured on a per-ColumnFamily basis.
        </para>
        <para>For more information, see the <link xlink:href="http://hbase.apache.org/xref/org/apache/hadoop/hbase/io/hfile/HFileBlock.html">HFileBlock source code</link>.
        </para>
      </section>
      <section xml:id="keyvalue">
        <title>KeyValue</title>
        <para>The KeyValue class is the heart of data storage in HBase.  KeyValue wraps a byte array and takes offsets and lengths into passed array
         at where to start interpreting the content as KeyValue.
        </para>
        <para>The KeyValue format inside a byte array is:
           <itemizedlist>
             <listitem>keylength</listitem>
             <listitem>valuelength</listitem>
             <listitem>key</listitem>
             <listitem>value</listitem>
           </itemizedlist>
        </para>
        <para>The Key is further decomposed as:
           <itemizedlist>
             <listitem>rowlength</listitem>
             <listitem>row (i.e., the rowkey)</listitem>
             <listitem>columnfamilylength</listitem>
             <listitem>columnfamily</listitem>
             <listitem>columnqualifier</listitem>
             <listitem>timestamp</listitem>
             <listitem>keytype (e.g., Put, Delete)</listitem>
           </itemizedlist>
        </para>
        <para>For more information, see the <link xlink:href="http://hbase.apache.org/xref/org/apache/hadoop/hbase/KeyValue.html">KeyValue source code</link>.
        </para>
      </section>
      <section xml:id="compaction">
        <title>Compaction</title>
        <para>There are two types of compactions:  minor and major.  Minor compactions will usually pick up a couple of the smaller adjacent
--- a/src/docbkx/ops_mgt.xml
+++ b/src/docbkx/ops_mgt.xml
@ -301,6 +301,32 @@ false
      <para>Since the cluster is up, there is a risk that edits could be missed in the export process.
      </para>
    </section>
  </section>  <!--  backup -->
  <section xml:id="ops.capacity"><title>Capacity Planning</title>
    <section xml:id="ops.capacity.storage"><title>Storage</title>
      <para>A common question for HBase administrators is estimating how much storage will be required for an HBase cluster.
      There are several apsects to consider, the most important of which is what data load into the cluster.  Start
      with a solid understanding of how HBase handles data internally (KeyValue).
      </para>
      <section xml:id="ops.capacity.storage.kv"><title>KeyValue</title>
        <para>HBase storage will be dominated by KeyValues.  See <xref linkend="keyvalue" /> and <xref linkend="keysize" /> for 
        how HBase stores data internally.  
        </para>
        <para>It is critical to understand that there is a KeyValue instance for every attribute stored in a row, and the 
        rowkey-length, ColumnFamily name-length and attribute lengths will drive the size of the database more than any other
        factor.
        </para>
      </section>
      <section xml:id="ops.capacity.storage.sf"><title>StoreFiles and Blocks</title>
        <para>KeyValue instances are aggregated into blocks, and the blocksize is configurable on a per-ColumnFamily basis.
        Blocks are aggregated into StoreFile's.  See <xref linkend="regions.arch" />.
        </para>
      </section>
      <section xml:id="ops.capacity.storage.hdfs"><title>HDFS Block Replication</title>
        <para>Because HBase runs on top of HDFS, factor in HDFS block replication into storage calculations.
        </para>
      </section>
    </section>
  </section>
 </chapter>