HBASE-4566 book.xml,ops_mgt.xml - KeyValue documentation

git-svn-id: https://svn.apache.org/repos/asf/hbase/trunk@1181091 13f79535-47bb-0310-9956-ffa450edef68
2011-10-10 17:41:53 +00:00 · 2011-10-10 17:41:53 +00:00 · 5364f98a32
commit 5364f98a32
parent f4d8833824
2 changed files with 64 additions and 2 deletions
--- a/src/docbkx/book.xml
+++ b/src/docbkx/book.xml
@ -312,7 +312,7 @@ public static class MyReducer extends TableReducer&lt;Text, IntWritable, Immutab
      <para>A good general introduction on the strength and weaknesses modelling on
          the various non-rdbms datastores is Ian Varleys' Master thesis,
          <link xlink:href="http://ianvarley.com/UT/MR/Varley_MastersReport_Full_2009-08-07.pdf">No Relation: The Mixed Blessings of Non-Relational Databases</link>.
-          Recommended.
+          Recommended.  Also, read <xref linkend="keyvalue"/> for how HBase stores data internally.
      </para>
  <section xml:id="schema.creation">
  <title>
@ -400,7 +400,7 @@ admin.enableTable(table);
       </para>
       <para>Most of the time small inefficiencies don't matter all that much.  Unfortunately,
         this is a case where they do.  Whatever patterns are selected for ColumnFamilies, attributes, and rowkeys they could be repeated
-       several billion times in your data</para>
+       several billion times in your data.  See <xref linkend="keyvalue"/> for more information on HBase stores data internally.</para>
       <section xml:id="keysize.cf"><title>Column Families</title>
         <para>Try to keep the ColumnFamily names as small as possible, preferably one character (e.g. "d" for data/default).
         </para> 
@ -1615,6 +1615,8 @@ scan.setFilter(filter);
              Schubert Zhang's blog post on <link xlink:ref="http://cloudepr.blogspot.com/2009/09/hfile-block-indexed-file-format-to.html">HFile: A Block-Indexed File Format to Store Sorted Key-Value Pairs</link> makes for a thorough introduction to HBase's hfile.  Matteo Bertozzi has also put up a
              helpful description, <link xlink:href="http://th30z.blogspot.com/2011/02/hbase-io-hfile.html?spref=tw">HBase I/O: HFile</link>.
          </para>
+          <para>For more information, see the <link xlink:href="http://hbase.apache.org/xref/org/apache/hadoop/hbase/io/hfile/HFile.html">HFile source code</link>.
+          </para>
      </section>

      <section xml:id="hfile_tool">
@ -1631,6 +1633,40 @@ scan.setFilter(filter);
        tool.</para>
      </section>
      </section>
+      <section xml:id="hfile.blocks">
+        <title>Blocks</title>
+        <para>StoreFiles are composed of blocks.  The blocksize is configured on a per-ColumnFamily basis.
+        </para>
+        <para>For more information, see the <link xlink:href="http://hbase.apache.org/xref/org/apache/hadoop/hbase/io/hfile/HFileBlock.html">HFileBlock source code</link>.
+        </para>
+      </section>
+      <section xml:id="keyvalue">
+        <title>KeyValue</title>
+        <para>The KeyValue class is the heart of data storage in HBase.  KeyValue wraps a byte array and takes offsets and lengths into passed array
+         at where to start interpreting the content as KeyValue.
+        </para>
+        <para>The KeyValue format inside a byte array is:
+           <itemizedlist>
+             <listitem>keylength</listitem>
+             <listitem>valuelength</listitem>
+             <listitem>key</listitem>
+             <listitem>value</listitem>
+           </itemizedlist>
+        </para>
+        <para>The Key is further decomposed as:
+           <itemizedlist>
+             <listitem>rowlength</listitem>
+             <listitem>row (i.e., the rowkey)</listitem>
+             <listitem>columnfamilylength</listitem>
+             <listitem>columnfamily</listitem>
+             <listitem>columnqualifier</listitem>
+             <listitem>timestamp</listitem>
+             <listitem>keytype (e.g., Put, Delete)</listitem>
+           </itemizedlist>
+        </para>
+        <para>For more information, see the <link xlink:href="http://hbase.apache.org/xref/org/apache/hadoop/hbase/KeyValue.html">KeyValue source code</link>.
+        </para>
+      </section>
      <section xml:id="compaction">
        <title>Compaction</title>
        <para>There are two types of compactions:  minor and major.  Minor compactions will usually pick up a couple of the smaller adjacent
--- a/src/docbkx/ops_mgt.xml
+++ b/src/docbkx/ops_mgt.xml
@ -301,6 +301,32 @@ false
      <para>Since the cluster is up, there is a risk that edits could be missed in the export process.
      </para>
    </section>
+  </section>  <!--  backup -->
+  <section xml:id="ops.capacity"><title>Capacity Planning</title>
+    <section xml:id="ops.capacity.storage"><title>Storage</title>
+      <para>A common question for HBase administrators is estimating how much storage will be required for an HBase cluster.
+      There are several apsects to consider, the most important of which is what data load into the cluster.  Start
+      with a solid understanding of how HBase handles data internally (KeyValue).
+      </para>
+      <section xml:id="ops.capacity.storage.kv"><title>KeyValue</title>
+        <para>HBase storage will be dominated by KeyValues.  See <xref linkend="keyvalue" /> and <xref linkend="keysize" /> for 
+        how HBase stores data internally.  
+        </para>
+        <para>It is critical to understand that there is a KeyValue instance for every attribute stored in a row, and the 
+        rowkey-length, ColumnFamily name-length and attribute lengths will drive the size of the database more than any other
+        factor.
+        </para>
+      </section>
+      <section xml:id="ops.capacity.storage.sf"><title>StoreFiles and Blocks</title>
+        <para>KeyValue instances are aggregated into blocks, and the blocksize is configurable on a per-ColumnFamily basis.
+        Blocks are aggregated into StoreFile's.  See <xref linkend="regions.arch" />.
+        </para>
+      </section>
+      <section xml:id="ops.capacity.storage.hdfs"><title>HDFS Block Replication</title>
+        <para>Because HBase runs on top of HDFS, factor in HDFS block replication into storage calculations.
+        </para>
+      </section>
+    </section>
  </section>

 </chapter>