HBASE-4165 reorganized performance chapter

git-svn-id: https://svn.apache.org/repos/asf/hbase/trunk@1153898 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
Doug Meil 2011-08-04 14:56:50 +00:00
parent c0acc54f57
commit 2af2f5dbd9
1 changed files with 65 additions and 39 deletions

View File

@ -134,13 +134,8 @@
<para>See <xref linkend="number.of.cfs" />.</para>
</section>
<section xml:id="perf.one.region">
<title>Data Clumping</title>
<para>If all your data is being written to one region, then re-read the
section on processing <link linkend="timeseries">timeseries</link>
data.</para>
</section>
<section xml:id="perf.writing">
<title>Writing to HBase</title>
<section xml:id="perf.batch.loading">
<title>Batch Loading</title>
@ -148,6 +143,7 @@
<link xlink:href="http://hbase.apache.org/bulk-loads.html">Bulk Loads</link>.
Otherwise, pay attention to the below.
</para>
</section> <!-- batch loading -->
<section xml:id="precreate.regions">
<title>
@ -199,13 +195,9 @@ Deferred log flush can be configured on tables via <link
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HTableDescriptor.html">HTableDescriptor</link>. The default value of <varname>hbase.regionserver.optionallogflushinterval</varname> is 1000ms.
</para>
</section>
</section> <!-- batch loading -->
<section>
<title>HBase Client</title>
<section xml:id="perf.hbase.client.autoflush">
<title>AutoFlush</title>
<title>HBase Client: AutoFlush</title>
<para>When performing a lot of Puts, make sure that setAutoFlush is set
to false on your <link
@ -218,6 +210,46 @@ Deferred log flush can be configured on tables via <link
Calling <methodname>close</methodname> on the <classname>HTable</classname>
instance will invoke <methodname>flushCommits</methodname>.</para>
</section>
<section xml:id="perf.hbase.client.putwal">
<title>HBase Client: Turn off WAL on Puts</title>
<para>A frequently discussed option for increasing throughput on <classname>Put</classname>s is to call <code>writeToWAL(false)</code>. Turning this off means
that the RegionServer will <emphasis>not</emphasis> write the <classname>Put</classname> to the Write Ahead Log,
only into the memstore, HOWEVER the consequence is that if there
is a RegionServer failure <emphasis>there will be data loss</emphasis>.
If <code>writeToWAL(false)</code> is used, do so with extreme caution. You may find in actuality that
it makes little difference if your load is well distributed across the cluster.
</para>
<para>In general, it is best to use WAL for Puts, and where loading throughput
is a concern to use <link linkend="perf.batch.loading">bulk loading</link> techniques instead.
</para>
</section>
<section xml:id="perf.hbase.client.regiongroup">
<title>HBase Client: Group Puts by RegionServer</title>
<para>In addition to using the writeBuffer, grouping <classname>Put</classname>s by RegionServer can reduce the number of client RPC calls per writeBuffer flush.
There is a utility <classname>HTableUtil</classname> currently on TRUNK that does this, but you can either copy that or implement your own verison for
those still on 0.90.x or earlier.
</para>
</section>
<section xml:id="perf.hbase.write.mr.reducer">
<title>MapReduce: Skip The Reducer</title>
<para>When writing a lot of data to an HBase table in a in a Mapper (e.g., with <link
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html">TableOutputFormat</link>),
skip the Reducer step whenever possible. When a Reducer step is used, all of the output (Puts) from the Mapper will get spooled to disk, then shuffled to other
Reducers that will most likely be off-node.
</para>
</section>
<section xml:id="perf.one.region">
<title>Anti-Pattern: One Hot Region</title>
<para>If all your data is being written to one region at a time, then re-read the
section on processing <link linkend="timeseries">timeseries</link> data.</para>
<para>Also, see <xref linkend="precreate.regions"/>, as well as <xref linkend="perf.configurations"/> </para>
</section>
</section> <!-- writing -->
<section xml:id="perf.reading">
<title>Reading from HBase</title>
<section xml:id="perf.hbase.client.caching">
<title>Scan Caching</title>
@ -286,18 +318,12 @@ htable.close();</programlisting></para>
and minimal network traffic to the client for a single row.
</para>
</section>
<section xml:id="perf.hbase.client.putwal">
<title>Turn off WAL on Puts</title>
<para>A frequently discussed option for increasing throughput on <classname>Put</classname>s is to call <code>writeToWAL(false)</code>. Turning this off means
that the RegionServer will <emphasis>not</emphasis> write the <classname>Put</classname> to the Write Ahead Log,
only into the memstore, HOWEVER the consequence is that if there
is a RegionServer failure <emphasis>there will be data loss</emphasis>.
If <code>writeToWAL(false)</code> is used, do so with extreme caution. You may find in actuality that
it makes little difference if your load is well distributed across the cluster.
</para>
<para>In general, it is best to use WAL for Puts, and where loading throughput
is a concern to use <link linkend="perf.batch.loading">bulk loading</link> techniques instead.
</para>
</section>
<section xml:id="perf.hbase.read.dist">
<title>Concurrency: Monitor Data Spread</title>
<para>When performing a high number of concurrent reads, monitor the data spread of the target tables. If there target table(s) are in
too few regions then the reads will fall on only a few nodes. </para>
<para>See <xref linkend="precreate.regions"/>, as well as <xref linkend="perf.configurations"/> </para>
</section>
</section> <!-- reading -->
</chapter>