HBASE-8592 [documentation] some updates for the reference guide regarding recent questions on the ML

git-svn-id: https://svn.apache.org/repos/asf/hbase/trunk@1485067 13f79535-47bb-0310-9956-ffa450edef68
2013-05-22 05:37:08 +00:00 · 2013-05-22 05:37:08 +00:00 · b1ab14d7b3
parent 3b28043ade
commit b1ab14d7b3
5 changed files with 66 additions and 3 deletions
--- a/src/main/docbkx/configuration.xml
+++ b/src/main/docbkx/configuration.xml
@ -65,8 +65,10 @@ to ensure well-formedness of your document after an edit session.

    <section xml:id="java">
        <title>Java</title>
-        <para>Just like Hadoop, HBase requires at least java 6 from
-        <link xlink:href="http://www.java.com/download/">Oracle</link>.</para>
+        <para>Just like Hadoop, HBase requires at least Java 6 from
+        <link xlink:href="http://www.java.com/download/">Oracle</link>.
+        Java 7 should work and can even be faster than Java 6, but almost all testing
+        has been done on the latter at this point.</para>
    </section>

    <section xml:id="os">
@ -701,6 +703,8 @@ stopping hbase...............</programlisting> Shutdown can take a moment to
      </section>

      <section xml:id="client_dependencies"><title>Client configuration and dependencies connecting to an HBase cluster</title>
+       <para>If you are running HBase in standalone mode, you don't need to configure anything for your client to work
+             provided that they are all on the same machine.</para>
       <para>
          Since the HBase Master may move around, clients bootstrap by looking to ZooKeeper for
          current critical locations.  ZooKeeper is where all these values are kept.  Thus clients
--- a/src/main/docbkx/ops_mgt.xml
+++ b/src/main/docbkx/ops_mgt.xml
@ -490,6 +490,28 @@ false
            </orderedlist>
        </para>
    </section>
+    <section xml:id="adding.new.node">
+        <title>Adding a New Node</title>
+        <para>Adding a new regionserver in HBase is essentially free, you simply start it like this:
+              <programlisting>$ ./bin/hbase-daemon.sh start regionserver</programlisting>
+              and it will register itself with the master. Ideally you also started a DataNode on the same
+              machine so that the RS can eventually start to have local files. If you rely on ssh to start your
+              daemons, don't forget to add the new hostname in <filename>conf/regionservers</filename> on the master.
+        </para>
+        <para>At this point the region server isn't serving data because no regions have moved to it yet. If the balancer is
+              enabled, it will start moving regions to the new RS. On a small/medium cluster this can have a very adverse effect
+              on latency as a lot of regions will be offline at the same time. It is thus recommended to disable the balancer
+              the same way it's done when decommissioning a node and move the regions manually (or even better, using a script
+              that moves them one by one).
+        </para>
+        <para>The moved regions will all have 0% locality and won't have any blocks in cache so the region server will have
+              to use the network to serve requests. Apart from resulting in higher latency, it may also be able to use all of
+              your network card's capacity. For practical purposes, consider that a standard 1GigE NIC won't be able to read
+              much more than <emphasis>100MB/s</emphasis>. In this case, or if you are in a OLAP environment and require having
+              locality, then it is recommended to major compact the moved regions.
+        </para>
+
+    </section>
    </section>  <!--  node mgt -->

  <section xml:id="hbase_metrics">
--- a/src/main/docbkx/performance.xml
+++ b/src/main/docbkx/performance.xml
@ -684,7 +684,7 @@ the data will still be read.
     <para>A fairly common question on the dist-list is why HBase isn't as performant as HDFS files in a batch context (e.g., as
     a MapReduce source or sink).  The short answer is that HBase is doing a lot more than HDFS (e.g., reading the KeyValues,
     returning the most current row or specified timestamps, etc.), and as such HBase is 4-5 times slower than HDFS in this
-     processing context.  Not that there isn't room for improvement (and this gap will, over time, be reduced), but HDFS
+     processing context.  There is room for improvement and this gap will, over time, be reduced, but HDFS
      will always be faster in this use-case.
     </para>
    </section>
@ -700,6 +700,24 @@ the data will still be read.
   </para>
  </section>

+  <section xml:id="perf.hbase.mr.cluster"><title>Collocating HBase and MapReduce</title>
+    <para>It is often recommended to have different clusters for HBase and MapReduce. A better qualification of this is:
+          don't collocate a HBase that serves live requests with a heavy MR workload. OLTP and OLAP-optimized systems have
+          conflicting requirements and one will lose to the other, usually the former. For example, short latency-sensitive
+          disk reads will have to wait in line behind longer reads that are trying to squeeze out as much throughput as
+          possible. MR jobs that write to HBase will also generate flushes and compactions, which will in turn invalidate
+          blocks in the <xref linkend="block.cache"/>.
+    </para>
+    <para>If you need to process the data from your live HBase cluster in MR, you can ship the deltas with <xref linkend="copy.table"/>
+          or use replication to get the new data in real time on the OLAP cluster. In the worst case, if you really need to
+          collocate both, set MR to use less Map and Reduce slots than you'd normally configure, possibly just one.
+    </para>
+    <para>When HBase is used for OLAP operations, it's preferable to set it up in a hardened way like configuring the ZooKeeper session
+          timeout higher and giving more memory to the MemStores (the argument being that the Block Cache won't be used much
+          since the workloads are usually long scans).
+    </para>
+  </section>
+
  <section xml:id="perf.casestudy"><title>Case Studies</title>
      <para>For Performance and Troubleshooting Case Studies, see <xref linkend="casestudies"/>.
      </para>
--- a/src/main/docbkx/schema_design.xml
+++ b/src/main/docbkx/schema_design.xml
@ -180,6 +180,21 @@ byte[] sbDigest = Bytes.toBytes(sDigest);
 System.out.println("md5 digest as string length: " + sbDigest.length);    // returns 26
 </programlisting>
         </para>
+         <para>Unfortunately, using a binary representation of a type will make your data harder to read outside of your code. For example,
+               this is what you will see in the shell when you increment a value:
+<programlisting>
+hbase(main):001:0> incr 't', 'r', 'f:q', 1
+COUNTER VALUE = 1
+
+hbase(main):002:0> get 't', 'r'
+COLUMN                                        CELL
+ f:q                                          timestamp=1369163040570, value=\x00\x00\x00\x00\x00\x00\x00\x01
+1 row(s) in 0.0310 seconds
+</programlisting>
+               The shell makes a best effort to print a string, and it this case it decided to just print the hex. The same will
+               happen to your row keys inside the region names. It can be okay if you know what's being stored, but it might also
+               be unreadable if arbitrary data can be put in the same cells. This is the main trade-off.
+         </para>
       </section>

    </section>
--- a/src/main/docbkx/troubleshooting.xml
+++ b/src/main/docbkx/troubleshooting.xml
@ -740,6 +740,10 @@ Caused by: java.io.FileNotFoundException: File _partition.lst does not exist.
 		    <para>See the <link xlink:href="see http://hadoop.apache.org/common/docs/current/hdfs_user_guide.html">HDFS User Guide</link> for other non-shell diagnostic
 		    utilities like <code>fsck</code>.
            </para>
+          <section xml:id="trouble.namenode.0size.hlogs">
+            <title>Zero size HLogs with data in them</title>
+              <para>Problem: when getting a listing of all the files in a region server's .logs directory, one file has a size of 0 but it contains data.</para>
+              <para>Answer: It's an HDFS quirk. A file that's currently being to will appear to have a size of 0 but once it's closed it will show its true size</para>
          <section xml:id="trouble.namenode.uncompaction">
            <title>Use Cases</title>
              <para>Two common use-cases for querying HDFS for HBase objects is research the degree of uncompaction of a table.  If there are a large number of StoreFiles for each ColumnFamily it could