HBASE-8592 [documentation] some updates for the reference guide regarding recent questions on the ML
git-svn-id: https://svn.apache.org/repos/asf/hbase/trunk@1485067 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
parent
3b28043ade
commit
b1ab14d7b3
|
@ -65,8 +65,10 @@ to ensure well-formedness of your document after an edit session.
|
|||
|
||||
<section xml:id="java">
|
||||
<title>Java</title>
|
||||
<para>Just like Hadoop, HBase requires at least java 6 from
|
||||
<link xlink:href="http://www.java.com/download/">Oracle</link>.</para>
|
||||
<para>Just like Hadoop, HBase requires at least Java 6 from
|
||||
<link xlink:href="http://www.java.com/download/">Oracle</link>.
|
||||
Java 7 should work and can even be faster than Java 6, but almost all testing
|
||||
has been done on the latter at this point.</para>
|
||||
</section>
|
||||
|
||||
<section xml:id="os">
|
||||
|
@ -701,6 +703,8 @@ stopping hbase...............</programlisting> Shutdown can take a moment to
|
|||
</section>
|
||||
|
||||
<section xml:id="client_dependencies"><title>Client configuration and dependencies connecting to an HBase cluster</title>
|
||||
<para>If you are running HBase in standalone mode, you don't need to configure anything for your client to work
|
||||
provided that they are all on the same machine.</para>
|
||||
<para>
|
||||
Since the HBase Master may move around, clients bootstrap by looking to ZooKeeper for
|
||||
current critical locations. ZooKeeper is where all these values are kept. Thus clients
|
||||
|
|
|
@ -490,6 +490,28 @@ false
|
|||
</orderedlist>
|
||||
</para>
|
||||
</section>
|
||||
<section xml:id="adding.new.node">
|
||||
<title>Adding a New Node</title>
|
||||
<para>Adding a new regionserver in HBase is essentially free, you simply start it like this:
|
||||
<programlisting>$ ./bin/hbase-daemon.sh start regionserver</programlisting>
|
||||
and it will register itself with the master. Ideally you also started a DataNode on the same
|
||||
machine so that the RS can eventually start to have local files. If you rely on ssh to start your
|
||||
daemons, don't forget to add the new hostname in <filename>conf/regionservers</filename> on the master.
|
||||
</para>
|
||||
<para>At this point the region server isn't serving data because no regions have moved to it yet. If the balancer is
|
||||
enabled, it will start moving regions to the new RS. On a small/medium cluster this can have a very adverse effect
|
||||
on latency as a lot of regions will be offline at the same time. It is thus recommended to disable the balancer
|
||||
the same way it's done when decommissioning a node and move the regions manually (or even better, using a script
|
||||
that moves them one by one).
|
||||
</para>
|
||||
<para>The moved regions will all have 0% locality and won't have any blocks in cache so the region server will have
|
||||
to use the network to serve requests. Apart from resulting in higher latency, it may also be able to use all of
|
||||
your network card's capacity. For practical purposes, consider that a standard 1GigE NIC won't be able to read
|
||||
much more than <emphasis>100MB/s</emphasis>. In this case, or if you are in a OLAP environment and require having
|
||||
locality, then it is recommended to major compact the moved regions.
|
||||
</para>
|
||||
|
||||
</section>
|
||||
</section> <!-- node mgt -->
|
||||
|
||||
<section xml:id="hbase_metrics">
|
||||
|
|
|
@ -684,7 +684,7 @@ the data will still be read.
|
|||
<para>A fairly common question on the dist-list is why HBase isn't as performant as HDFS files in a batch context (e.g., as
|
||||
a MapReduce source or sink). The short answer is that HBase is doing a lot more than HDFS (e.g., reading the KeyValues,
|
||||
returning the most current row or specified timestamps, etc.), and as such HBase is 4-5 times slower than HDFS in this
|
||||
processing context. Not that there isn't room for improvement (and this gap will, over time, be reduced), but HDFS
|
||||
processing context. There is room for improvement and this gap will, over time, be reduced, but HDFS
|
||||
will always be faster in this use-case.
|
||||
</para>
|
||||
</section>
|
||||
|
@ -700,6 +700,24 @@ the data will still be read.
|
|||
</para>
|
||||
</section>
|
||||
|
||||
<section xml:id="perf.hbase.mr.cluster"><title>Collocating HBase and MapReduce</title>
|
||||
<para>It is often recommended to have different clusters for HBase and MapReduce. A better qualification of this is:
|
||||
don't collocate a HBase that serves live requests with a heavy MR workload. OLTP and OLAP-optimized systems have
|
||||
conflicting requirements and one will lose to the other, usually the former. For example, short latency-sensitive
|
||||
disk reads will have to wait in line behind longer reads that are trying to squeeze out as much throughput as
|
||||
possible. MR jobs that write to HBase will also generate flushes and compactions, which will in turn invalidate
|
||||
blocks in the <xref linkend="block.cache"/>.
|
||||
</para>
|
||||
<para>If you need to process the data from your live HBase cluster in MR, you can ship the deltas with <xref linkend="copy.table"/>
|
||||
or use replication to get the new data in real time on the OLAP cluster. In the worst case, if you really need to
|
||||
collocate both, set MR to use less Map and Reduce slots than you'd normally configure, possibly just one.
|
||||
</para>
|
||||
<para>When HBase is used for OLAP operations, it's preferable to set it up in a hardened way like configuring the ZooKeeper session
|
||||
timeout higher and giving more memory to the MemStores (the argument being that the Block Cache won't be used much
|
||||
since the workloads are usually long scans).
|
||||
</para>
|
||||
</section>
|
||||
|
||||
<section xml:id="perf.casestudy"><title>Case Studies</title>
|
||||
<para>For Performance and Troubleshooting Case Studies, see <xref linkend="casestudies"/>.
|
||||
</para>
|
||||
|
|
|
@ -180,6 +180,21 @@ byte[] sbDigest = Bytes.toBytes(sDigest);
|
|||
System.out.println("md5 digest as string length: " + sbDigest.length); // returns 26
|
||||
</programlisting>
|
||||
</para>
|
||||
<para>Unfortunately, using a binary representation of a type will make your data harder to read outside of your code. For example,
|
||||
this is what you will see in the shell when you increment a value:
|
||||
<programlisting>
|
||||
hbase(main):001:0> incr 't', 'r', 'f:q', 1
|
||||
COUNTER VALUE = 1
|
||||
|
||||
hbase(main):002:0> get 't', 'r'
|
||||
COLUMN CELL
|
||||
f:q timestamp=1369163040570, value=\x00\x00\x00\x00\x00\x00\x00\x01
|
||||
1 row(s) in 0.0310 seconds
|
||||
</programlisting>
|
||||
The shell makes a best effort to print a string, and it this case it decided to just print the hex. The same will
|
||||
happen to your row keys inside the region names. It can be okay if you know what's being stored, but it might also
|
||||
be unreadable if arbitrary data can be put in the same cells. This is the main trade-off.
|
||||
</para>
|
||||
</section>
|
||||
|
||||
</section>
|
||||
|
|
|
@ -740,6 +740,10 @@ Caused by: java.io.FileNotFoundException: File _partition.lst does not exist.
|
|||
<para>See the <link xlink:href="see http://hadoop.apache.org/common/docs/current/hdfs_user_guide.html">HDFS User Guide</link> for other non-shell diagnostic
|
||||
utilities like <code>fsck</code>.
|
||||
</para>
|
||||
<section xml:id="trouble.namenode.0size.hlogs">
|
||||
<title>Zero size HLogs with data in them</title>
|
||||
<para>Problem: when getting a listing of all the files in a region server's .logs directory, one file has a size of 0 but it contains data.</para>
|
||||
<para>Answer: It's an HDFS quirk. A file that's currently being to will appear to have a size of 0 but once it's closed it will show its true size</para>
|
||||
<section xml:id="trouble.namenode.uncompaction">
|
||||
<title>Use Cases</title>
|
||||
<para>Two common use-cases for querying HDFS for HBase objects is research the degree of uncompaction of a table. If there are a large number of StoreFiles for each ColumnFamily it could
|
||||
|
|
Loading…
Reference in New Issue