HBASE-8592 [documentation] some updates for the reference guide regarding recent questions on the ML

git-svn-id: https://svn.apache.org/repos/asf/hbase/trunk@1485067 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
Michael Stack 2013-05-22 05:37:08 +00:00
parent 3b28043ade
commit b1ab14d7b3
5 changed files with 66 additions and 3 deletions

View File

@ -65,8 +65,10 @@ to ensure well-formedness of your document after an edit session.
<section xml:id="java">
<title>Java</title>
<para>Just like Hadoop, HBase requires at least java 6 from
<link xlink:href="http://www.java.com/download/">Oracle</link>.</para>
<para>Just like Hadoop, HBase requires at least Java 6 from
<link xlink:href="http://www.java.com/download/">Oracle</link>.
Java 7 should work and can even be faster than Java 6, but almost all testing
has been done on the latter at this point.</para>
</section>
<section xml:id="os">
@ -701,6 +703,8 @@ stopping hbase...............</programlisting> Shutdown can take a moment to
</section>
<section xml:id="client_dependencies"><title>Client configuration and dependencies connecting to an HBase cluster</title>
<para>If you are running HBase in standalone mode, you don't need to configure anything for your client to work
provided that they are all on the same machine.</para>
<para>
Since the HBase Master may move around, clients bootstrap by looking to ZooKeeper for
current critical locations. ZooKeeper is where all these values are kept. Thus clients

View File

@ -490,6 +490,28 @@ false
</orderedlist>
</para>
</section>
<section xml:id="adding.new.node">
<title>Adding a New Node</title>
<para>Adding a new regionserver in HBase is essentially free, you simply start it like this:
<programlisting>$ ./bin/hbase-daemon.sh start regionserver</programlisting>
and it will register itself with the master. Ideally you also started a DataNode on the same
machine so that the RS can eventually start to have local files. If you rely on ssh to start your
daemons, don't forget to add the new hostname in <filename>conf/regionservers</filename> on the master.
</para>
<para>At this point the region server isn't serving data because no regions have moved to it yet. If the balancer is
enabled, it will start moving regions to the new RS. On a small/medium cluster this can have a very adverse effect
on latency as a lot of regions will be offline at the same time. It is thus recommended to disable the balancer
the same way it's done when decommissioning a node and move the regions manually (or even better, using a script
that moves them one by one).
</para>
<para>The moved regions will all have 0% locality and won't have any blocks in cache so the region server will have
to use the network to serve requests. Apart from resulting in higher latency, it may also be able to use all of
your network card's capacity. For practical purposes, consider that a standard 1GigE NIC won't be able to read
much more than <emphasis>100MB/s</emphasis>. In this case, or if you are in a OLAP environment and require having
locality, then it is recommended to major compact the moved regions.
</para>
</section>
</section> <!-- node mgt -->
<section xml:id="hbase_metrics">

View File

@ -684,7 +684,7 @@ the data will still be read.
<para>A fairly common question on the dist-list is why HBase isn't as performant as HDFS files in a batch context (e.g., as
a MapReduce source or sink). The short answer is that HBase is doing a lot more than HDFS (e.g., reading the KeyValues,
returning the most current row or specified timestamps, etc.), and as such HBase is 4-5 times slower than HDFS in this
processing context. Not that there isn't room for improvement (and this gap will, over time, be reduced), but HDFS
processing context. There is room for improvement and this gap will, over time, be reduced, but HDFS
will always be faster in this use-case.
</para>
</section>
@ -700,6 +700,24 @@ the data will still be read.
</para>
</section>
<section xml:id="perf.hbase.mr.cluster"><title>Collocating HBase and MapReduce</title>
<para>It is often recommended to have different clusters for HBase and MapReduce. A better qualification of this is:
don't collocate a HBase that serves live requests with a heavy MR workload. OLTP and OLAP-optimized systems have
conflicting requirements and one will lose to the other, usually the former. For example, short latency-sensitive
disk reads will have to wait in line behind longer reads that are trying to squeeze out as much throughput as
possible. MR jobs that write to HBase will also generate flushes and compactions, which will in turn invalidate
blocks in the <xref linkend="block.cache"/>.
</para>
<para>If you need to process the data from your live HBase cluster in MR, you can ship the deltas with <xref linkend="copy.table"/>
or use replication to get the new data in real time on the OLAP cluster. In the worst case, if you really need to
collocate both, set MR to use less Map and Reduce slots than you'd normally configure, possibly just one.
</para>
<para>When HBase is used for OLAP operations, it's preferable to set it up in a hardened way like configuring the ZooKeeper session
timeout higher and giving more memory to the MemStores (the argument being that the Block Cache won't be used much
since the workloads are usually long scans).
</para>
</section>
<section xml:id="perf.casestudy"><title>Case Studies</title>
<para>For Performance and Troubleshooting Case Studies, see <xref linkend="casestudies"/>.
</para>

View File

@ -180,6 +180,21 @@ byte[] sbDigest = Bytes.toBytes(sDigest);
System.out.println("md5 digest as string length: " + sbDigest.length); // returns 26
</programlisting>
</para>
<para>Unfortunately, using a binary representation of a type will make your data harder to read outside of your code. For example,
this is what you will see in the shell when you increment a value:
<programlisting>
hbase(main):001:0> incr 't', 'r', 'f:q', 1
COUNTER VALUE = 1
hbase(main):002:0> get 't', 'r'
COLUMN CELL
f:q timestamp=1369163040570, value=\x00\x00\x00\x00\x00\x00\x00\x01
1 row(s) in 0.0310 seconds
</programlisting>
The shell makes a best effort to print a string, and it this case it decided to just print the hex. The same will
happen to your row keys inside the region names. It can be okay if you know what's being stored, but it might also
be unreadable if arbitrary data can be put in the same cells. This is the main trade-off.
</para>
</section>
</section>

View File

@ -740,6 +740,10 @@ Caused by: java.io.FileNotFoundException: File _partition.lst does not exist.
<para>See the <link xlink:href="see http://hadoop.apache.org/common/docs/current/hdfs_user_guide.html">HDFS User Guide</link> for other non-shell diagnostic
utilities like <code>fsck</code>.
</para>
<section xml:id="trouble.namenode.0size.hlogs">
<title>Zero size HLogs with data in them</title>
<para>Problem: when getting a listing of all the files in a region server's .logs directory, one file has a size of 0 but it contains data.</para>
<para>Answer: It's an HDFS quirk. A file that's currently being to will appear to have a size of 0 but once it's closed it will show its true size</para>
<section xml:id="trouble.namenode.uncompaction">
<title>Use Cases</title>
<para>Two common use-cases for querying HDFS for HBase objects is research the degree of uncompaction of a table. If there are a large number of StoreFiles for each ColumnFamily it could