hbase-5404. book.xml, performance.xml - more info on compression and schema design

git-svn-id: https://svn.apache.org/repos/asf/hbase/trunk@1244649 13f79535-47bb-0310-9956-ffa450edef68
2012-02-15 19:08:05 +00:00 · 2012-02-15 19:08:05 +00:00 · 421c120f4a
parent 71682997f3
commit 421c120f4a
2 changed files with 16 additions and 3 deletions
--- a/src/docbkx/book.xml
+++ b/src/docbkx/book.xml
@ -648,15 +648,17 @@ admin.enableTable(table);
       <para>Most of the time small inefficiencies don't matter all that much.  Unfortunately,
         this is a case where they do.  Whatever patterns are selected for ColumnFamilies, attributes, and rowkeys they could be repeated
       several billion times in your data. </para>
-       <para>See <xref linkend="keyvalue"/> for more information on HBase stores data internally.</para>
+       <para>See <xref linkend="keyvalue"/> for more information on HBase stores data internally to see why this is important.</para>
       <section xml:id="keysize.cf"><title>Column Families</title>
         <para>Try to keep the ColumnFamily names as small as possible, preferably one character (e.g. "d" for data/default).
         </para> 
+       <para>See <xref linkend="keyvalue"/> for more information on HBase stores data internally to see why this is important.</para>
       </section>
       <section xml:id="keysize.atttributes"><title>Attributes</title>
         <para>Although verbose attribute names (e.g., "myVeryImportantAttribute") are easier to read, prefer shorter attribute names (e.g., "via")
         to store in HBase.
         </para> 
+       <para>See <xref linkend="keyvalue"/> for more information on HBase stores data internally to see why this is important.</para>
       </section>
       <section xml:id="keysize.row"><title>Rowkey Length</title>
         <para>Keep them as short as is reasonable such that they can still be useful for required data access (e.g., Get vs. Scan). 
@ -692,6 +694,7 @@ System.out.println("md5 digest as string length: " + sbDigest.length);    // ret
 </programlisting>               
         </para>
       </section>
+       
    </section>
    <section xml:id="reverse.timestamp"><title>Reverse Timestamps</title>
    <para>A common problem in database processing is quickly finding the most recent version of a value.  A technique using reverse timestamps
@ -888,7 +891,7 @@ System.out.println("md5 digest as string length: " + sbDigest.length);    // ret
  </section>
  <section xml:id="schema.ops"><title>Operational and Performance Configuration Options</title>
    <para>See the Performance section <xref linkend="perf.schema"/> for more information operational and performance
-    schema design options, such as Bloom Filters, Table-configured regionsizes, and blocksizes.
+    schema design options, such as Bloom Filters, Table-configured regionsizes, compression, and blocksizes.
    </para>
  </section>  

--- a/src/docbkx/performance.xml
+++ b/src/docbkx/performance.xml
@ -198,7 +198,8 @@
    </section>
    <section xml:id="perf.schema.keys">
      <title>Key and Attribute Lengths</title>
-      <para>See <xref linkend="keysize" />.</para>
+      <para>See <xref linkend="keysize" />.  See also <xref linkend="perf.compression.however" /> for 
+      compression caveats.</para>
    </section>
    <section xml:id="schema.regionsize"><title>Table RegionSize</title>
    <para>The regionsize can be set on a per-table basis via <code>setFileSize</code> on
@ -244,6 +245,15 @@
      <title>Compression</title>
      <para>Production systems should use compression with their ColumnFamily definitions.  See <xref linkend="compression" /> for more information.
      </para>
+      <section xml:id="perf.compression.however"><title>However...</title>
+         <para>Compression deflates data <emphasis>on disk</emphasis>.  When it's in-memory (e.g., in the 
+         MemStore) or on the wire (e.g., transferring between RegionServer and Client) it's inflated.
+         So while using ColumnFamily compression is a best practice, but it's not going to completely eliminate
+         the impact of over-sized Keys, over-sized ColumnFamily names, or over-sized Column names. 
+         </para>
+         <para>See <xref linkend="keysize" /> on for schema design tips, and <xref linkend="keyvalue"/> for more information on HBase stores data internally.
+         </para> 
+      </section>
    </section>
  </section>  <!--  perf schema -->