Add note that can OOME if many regions and MSLAB on

git-svn-id: https://svn.apache.org/repos/asf/hbase/trunk@1406306 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
Michael Stack 2012-11-06 20:20:16 +00:00
parent 9b3e63dcb7
commit 1ed61755ad
1 changed files with 66 additions and 59 deletions

View File

@ -47,7 +47,7 @@
<title>Network</title>
<para>
Perhaps the most important factor in avoiding network issues degrading Hadoop and HBbase performance is the switching hardware
that is used, decisions made early in the scope of the project can cause major problems when you double or triple the size of your cluster (or more).
that is used, decisions made early in the scope of the project can cause major problems when you double or triple the size of your cluster (or more).
</para>
<para>
Important items to consider:
@ -59,15 +59,15 @@
</para>
<section xml:id="perf.network.1switch">
<title>Single Switch</title>
<para>The single most important factor in this configuration is that the switching capacity of the hardware is capable of
<para>The single most important factor in this configuration is that the switching capacity of the hardware is capable of
handling the traffic which can be generated by all systems connected to the switch. Some lower priced commodity hardware
can have a slower switching capacity than could be utilized by a full switch.
can have a slower switching capacity than could be utilized by a full switch.
</para>
</section>
<section xml:id="perf.network.2switch">
<title>Multiple Switches</title>
<para>Multiple switches are a potential pitfall in the architecture. The most common configuration of lower priced hardware is a
simple 1Gbps uplink from one switch to another. This often overlooked pinch point can easily become a bottleneck for cluster communication.
simple 1Gbps uplink from one switch to another. This often overlooked pinch point can easily become a bottleneck for cluster communication.
Especially with MapReduce jobs that are both reading and writing a lot of data the communication across this uplink could be saturated.
</para>
<para>Mitigation of this issue is fairly simple and can be accomplished in multiple ways:
@ -85,10 +85,10 @@
<listitem>Poor switch capacity performance</listitem>
<listitem>Insufficient uplink to another rack</listitem>
</itemizedlist>
If the the switches in your rack have appropriate switching capacity to handle all the hosts at full speed, the next most likely issue will be caused by homing
If the the switches in your rack have appropriate switching capacity to handle all the hosts at full speed, the next most likely issue will be caused by homing
more of your cluster across racks. The easiest way to avoid issues when spanning multiple racks is to use port trunking to create a bonded uplink to other racks.
The downside of this method however, is in the overhead of ports that could potentially be used. An example of this is, creating an 8Gbps port channel from rack
A to rack B, using 8 of your 24 ports to communicate between racks gives you a poor ROI, using too few however can mean you're not getting the most out of your cluster.
A to rack B, using 8 of your 24 ports to communicate between racks gives you a poor ROI, using too few however can mean you're not getting the most out of your cluster.
</para>
<para>Using 10Gbe links between racks will greatly increase performance, and assuming your switches support a 10Gbe uplink or allow for an expansion card will allow you to
save your ports for machines as opposed to uplinks.
@ -128,7 +128,14 @@
slides for background and detail<footnote><para>The latest jvms do better
regards fragmentation so make sure you are running a recent release.
Read down in the message,
<link xlink:href="http://osdir.com/ml/hotspot-gc-use/2011-11/msg00002.html">Identifying concurrent mode failures caused by fragmentation</link>.</para></footnote>.</para>
<link xlink:href="http://osdir.com/ml/hotspot-gc-use/2011-11/msg00002.html">Identifying concurrent mode failures caused by fragmentation</link>.</para></footnote>.
Be aware that when enabled, each MemStore instance will occupy at least
an MSLAB instance of memory. If you have thousands of regions or lots
of regions each with many column families, this allocation of MSLAB
may be responsible for a good portion of your heap allocation and in
an extreme case cause you to OOME. Disable MSLAB in this case, or
lower the amount of memory it uses or float less regions per server.
</para>
<para>For more information about GC logs, see <xref linkend="trouble.log.gc" />.
</para>
</section>
@ -159,44 +166,44 @@
<section xml:id="perf.handlers">
<title><varname>hbase.regionserver.handler.count</varname></title>
<para>See <xref linkend="hbase.regionserver.handler.count"/>.
<para>See <xref linkend="hbase.regionserver.handler.count"/>.
</para>
</section>
<section xml:id="perf.hfile.block.cache.size">
<title><varname>hfile.block.cache.size</varname></title>
<para>See <xref linkend="hfile.block.cache.size"/>.
<para>See <xref linkend="hfile.block.cache.size"/>.
A memory setting for the RegionServer process.
</para>
</section>
</section>
<section xml:id="perf.rs.memstore.upperlimit">
<title><varname>hbase.regionserver.global.memstore.upperLimit</varname></title>
<para>See <xref linkend="hbase.regionserver.global.memstore.upperLimit"/>.
<para>See <xref linkend="hbase.regionserver.global.memstore.upperLimit"/>.
This memory setting is often adjusted for the RegionServer process depending on needs.
</para>
</section>
</section>
<section xml:id="perf.rs.memstore.lowerlimit">
<title><varname>hbase.regionserver.global.memstore.lowerLimit</varname></title>
<para>See <xref linkend="hbase.regionserver.global.memstore.lowerLimit"/>.
<para>See <xref linkend="hbase.regionserver.global.memstore.lowerLimit"/>.
This memory setting is often adjusted for the RegionServer process depending on needs.
</para>
</section>
<section xml:id="perf.hstore.blockingstorefiles">
<title><varname>hbase.hstore.blockingStoreFiles</varname></title>
<para>See <xref linkend="hbase.hstore.blockingStoreFiles"/>.
<para>See <xref linkend="hbase.hstore.blockingStoreFiles"/>.
If there is blocking in the RegionServer logs, increasing this can help.
</para>
</section>
<section xml:id="perf.hregion.memstore.block.multiplier">
<title><varname>hbase.hregion.memstore.block.multiplier</varname></title>
<para>See <xref linkend="hbase.hregion.memstore.block.multiplier"/>.
If there is enough RAM, increasing this can help.
<para>See <xref linkend="hbase.hregion.memstore.block.multiplier"/>.
If there is enough RAM, increasing this can help.
</para>
</section>
<section xml:id="hbase.regionserver.checksum.verify">
<title><varname>hbase.regionserver.checksum.verify</varname></title>
<para>Have HBase write the checksum into the datablock and save
having to do the checksum seek whenever you read. See the
release note on <link xlink:href="https://issues.apache.org/jira/browse/HBASE-5074">HBASE-5074 support checksums in HBase block cache</link>.
release note on <link xlink:href="https://issues.apache.org/jira/browse/HBASE-5074">HBASE-5074 support checksums in HBase block cache</link>.
</para>
</section>
@ -218,7 +225,7 @@ more discussion around short circuit reads.
</para>
<para>To enable "short circuit" reads, you must set two configurations.
First, the hdfs-site.xml needs to be amended. Set
the property <varname>dfs.block.local-path-access.user</varname>
the property <varname>dfs.block.local-path-access.user</varname>
to be the <emphasis>only</emphasis> user that can use the shortcut.
This has to be the user that started HBase. Then in hbase-site.xml,
set <varname>dfs.client.read.shortcircuit</varname> to be <varname>true</varname>
@ -241,19 +248,19 @@ the data will still be read.
</section>
<section xml:id="perf.schema">
<title>Schema Design</title>
<section xml:id="perf.number.of.cfs">
<title>Number of Column Families</title>
<para>See <xref linkend="number.of.cfs" />.</para>
</section>
<section xml:id="perf.schema.keys">
<title>Key and Attribute Lengths</title>
<para>See <xref linkend="keysize" />. See also <xref linkend="perf.compression.however" /> for
<para>See <xref linkend="keysize" />. See also <xref linkend="perf.compression.however" /> for
compression caveats.</para>
</section>
<section xml:id="schema.regionsize"><title>Table RegionSize</title>
<para>The regionsize can be set on a per-table basis via <code>setFileSize</code> on
<link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HTableDescriptor.html">HTableDescriptor</link> in the
<link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HTableDescriptor.html">HTableDescriptor</link> in the
event where certain tables require different regionsizes than the configured default regionsize.
</para>
<para>See <xref linkend="perf.number.of.regions"/> for more information.
@ -269,23 +276,23 @@ the data will still be read.
on each insert. If <varname>ROWCOL</varname>, the hash of the row +
column family + column family qualifier will be added to the bloom on
each key insert.</para>
<para>See <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html">HColumnDescriptor</link> and
<para>See <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html">HColumnDescriptor</link> and
<xref linkend="blooms"/> for more information or this answer up in quora,
<link xlink:href="http://www.quora.com/How-are-bloom-filters-used-in-HBase">How are bloom filters used in HBase?</link>.
</para>
</section>
<section xml:id="schema.cf.blocksize"><title>ColumnFamily BlockSize</title>
<para>The blocksize can be configured for each ColumnFamily in a table, and this defaults to 64k. Larger cell values require larger blocksizes.
<para>The blocksize can be configured for each ColumnFamily in a table, and this defaults to 64k. Larger cell values require larger blocksizes.
There is an inverse relationship between blocksize and the resulting StoreFile indexes (i.e., if the blocksize is doubled then the resulting
indexes should be roughly halved).
</para>
<para>See <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html">HColumnDescriptor</link>
<para>See <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html">HColumnDescriptor</link>
and <xref linkend="store"/>for more information.
</para>
</section>
<section xml:id="cf.in.memory">
<title>In-Memory ColumnFamilies</title>
<para>ColumnFamilies can optionally be defined as in-memory. Data is still persisted to disk, just like any other ColumnFamily.
<para>ColumnFamilies can optionally be defined as in-memory. Data is still persisted to disk, just like any other ColumnFamily.
In-memory blocks have the highest priority in the <xref linkend="block.cache" />, but it is not a guarantee that the entire table
will be in memory.
</para>
@ -297,17 +304,17 @@ the data will still be read.
<para>Production systems should use compression with their ColumnFamily definitions. See <xref linkend="compression" /> for more information.
</para>
<section xml:id="perf.compression.however"><title>However...</title>
<para>Compression deflates data <emphasis>on disk</emphasis>. When it's in-memory (e.g., in the
<para>Compression deflates data <emphasis>on disk</emphasis>. When it's in-memory (e.g., in the
MemStore) or on the wire (e.g., transferring between RegionServer and Client) it's inflated.
So while using ColumnFamily compression is a best practice, but it's not going to completely eliminate
the impact of over-sized Keys, over-sized ColumnFamily names, or over-sized Column names.
the impact of over-sized Keys, over-sized ColumnFamily names, or over-sized Column names.
</para>
<para>See <xref linkend="keysize" /> on for schema design tips, and <xref linkend="keyvalue"/> for more information on HBase stores data internally.
</para>
</para>
</section>
</section>
</section> <!-- perf schema -->
<section xml:id="perf.writing">
<title>Writing to HBase</title>
@ -335,7 +342,7 @@ throws IOException {
} catch (TableExistsException e) {
logger.info("table " + table.getNameAsString() + " already exists");
// the table already exists...
return false;
return false;
}
}
@ -360,7 +367,7 @@ public static byte[][] getHexSplits(String startKey, String endKey, int numRegio
Table Creation: Deferred Log Flush
</title>
<para>
The default behavior for Puts using the Write Ahead Log (WAL) is that <classname>HLog</classname> edits will be written immediately. If deferred log flush is used,
The default behavior for Puts using the Write Ahead Log (WAL) is that <classname>HLog</classname> edits will be written immediately. If deferred log flush is used,
WAL edits are kept in memory until the flush period. The benefit is aggregated and asynchronous <classname>HLog</classname>- writes, but the potential downside is that if
the RegionServer goes down the yet-to-be-flushed edits are lost. This is safer, however, than not using WAL at all with Puts.
</para>
@ -368,7 +375,7 @@ WAL edits are kept in memory until the flush period. The benefit is aggregated
Deferred log flush can be configured on tables via <link
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HTableDescriptor.html">HTableDescriptor</link>. The default value of <varname>hbase.regionserver.optionallogflushinterval</varname> is 1000ms.
</para>
</section>
</section>
<section xml:id="perf.hbase.client.autoflush">
<title>HBase Client: AutoFlush</title>
@ -394,25 +401,25 @@ Deferred log flush can be configured on tables via <link
it makes little difference if your load is well distributed across the cluster.
</para>
<para>In general, it is best to use WAL for Puts, and where loading throughput
is a concern to use <link linkend="perf.batch.loading">bulk loading</link> techniques instead.
is a concern to use <link linkend="perf.batch.loading">bulk loading</link> techniques instead.
</para>
</section>
<section xml:id="perf.hbase.client.regiongroup">
<title>HBase Client: Group Puts by RegionServer</title>
<para>In addition to using the writeBuffer, grouping <classname>Put</classname>s by RegionServer can reduce the number of client RPC calls per writeBuffer flush.
<para>In addition to using the writeBuffer, grouping <classname>Put</classname>s by RegionServer can reduce the number of client RPC calls per writeBuffer flush.
There is a utility <classname>HTableUtil</classname> currently on TRUNK that does this, but you can either copy that or implement your own verison for
those still on 0.90.x or earlier.
</para>
</section>
</section>
<section xml:id="perf.hbase.write.mr.reducer">
<title>MapReduce: Skip The Reducer</title>
<para>When writing a lot of data to an HBase table from a MR job (e.g., with <link
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html">TableOutputFormat</link>), and specifically where Puts are being emitted
from the Mapper, skip the Reducer step. When a Reducer step is used, all of the output (Puts) from the Mapper will get spooled to disk, then sorted/shuffled to other
Reducers that will most likely be off-node. It's far more efficient to just write directly to HBase.
from the Mapper, skip the Reducer step. When a Reducer step is used, all of the output (Puts) from the Mapper will get spooled to disk, then sorted/shuffled to other
Reducers that will most likely be off-node. It's far more efficient to just write directly to HBase.
</para>
<para>For summary jobs where HBase is used as a source and a sink, then writes will be coming from the Reducer step (e.g., summarize values then write out result).
This is a different processing problem than from the the above case.
<para>For summary jobs where HBase is used as a source and a sink, then writes will be coming from the Reducer step (e.g., summarize values then write out result).
This is a different processing problem than from the the above case.
</para>
</section>
@ -421,16 +428,16 @@ Deferred log flush can be configured on tables via <link
<para>If all your data is being written to one region at a time, then re-read the
section on processing <link linkend="timeseries">timeseries</link> data.</para>
<para>Also, if you are pre-splitting regions and all your data is <emphasis>still</emphasis> winding up in a single region even though
your keys aren't monotonically increasing, confirm that your keyspace actually works with the split strategy. There are a
your keys aren't monotonically increasing, confirm that your keyspace actually works with the split strategy. There are a
variety of reasons that regions may appear "well split" but won't work with your data. As
the HBase client communicates directly with the RegionServers, this can be obtained via
the HBase client communicates directly with the RegionServers, this can be obtained via
<link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#getRegionLocation%28byte[]%29">HTable.getRegionLocation</link>.
</para>
<para>See <xref linkend="precreate.regions"/>, as well as <xref linkend="perf.configurations"/> </para>
<para>See <xref linkend="precreate.regions"/>, as well as <xref linkend="perf.configurations"/> </para>
</section>
</section> <!-- writing -->
<section xml:id="perf.reading">
<title>Reading from HBase</title>
@ -452,7 +459,7 @@ Deferred log flush can be configured on tables via <link
<para>Scan settings in MapReduce jobs deserve special attention. Timeouts can result (e.g., UnknownScannerException)
in Map tasks if it takes longer to process a batch of records before the client goes back to the RegionServer for the
next set of data. This problem can occur because there is non-trivial processing occuring per row. If you process
rows quickly, set caching higher. If you process rows more slowly (e.g., lots of transformations per row, writes),
rows quickly, set caching higher. If you process rows more slowly (e.g., lots of transformations per row, writes),
then set caching lower.
</para>
<para>Timeouts can also happen in a non-MapReduce use case (i.e., single threaded HBase client doing a Scan), but the
@ -472,8 +479,8 @@ Deferred log flush can be configured on tables via <link
</section>
<section xml:id="perf.hbase.mr.input">
<title>MapReduce - Input Splits</title>
<para>For MapReduce jobs that use HBase tables as a source, if there a pattern where the "slow" map tasks seem to
have the same Input Split (i.e., the RegionServer serving the data), see the
<para>For MapReduce jobs that use HBase tables as a source, if there a pattern where the "slow" map tasks seem to
have the same Input Split (i.e., the RegionServer serving the data), see the
Troubleshooting Case Study in <xref linkend="casestudies.slownode"/>.
</para>
</section>
@ -522,9 +529,9 @@ htable.close();</programlisting></para>
</section>
<section xml:id="perf.hbase.read.dist">
<title>Concurrency: Monitor Data Spread</title>
<para>When performing a high number of concurrent reads, monitor the data spread of the target tables. If the target table(s) have
<para>When performing a high number of concurrent reads, monitor the data spread of the target tables. If the target table(s) have
too few regions then the reads could likely be served from too few nodes. </para>
<para>See <xref linkend="precreate.regions"/>, as well as <xref linkend="perf.configurations"/> </para>
<para>See <xref linkend="precreate.regions"/>, as well as <xref linkend="perf.configurations"/> </para>
</section>
<section xml:id="blooms">
<title>Bloom Filters</title>
@ -554,7 +561,7 @@ htable.close();</programlisting></para>
</footnote></para>
<para>See also <xref linkend="schema.bloom" />.
</para>
<section xml:id="bloom_footprint">
<title>Bloom StoreFile footprint</title>
@ -584,7 +591,7 @@ htable.close();</programlisting></para>
data. Obtained on-demand. Stored in the LRU cache, if it is enabled
(Its enabled by default).</para>
</section>
</section>
</section>
<section xml:id="config.bloom">
<title>Bloom Filter Configuration</title>
<section>
@ -615,10 +622,10 @@ htable.close();</programlisting></para>
in HBase</link> for more on what this option means.</para>
</section>
</section>
</section> <!-- bloom -->
</section> <!-- bloom -->
</section> <!-- reading -->
<section xml:id="perf.deleting">
<title>Deleting from HBase</title>
<section xml:id="perf.deleting.queue">
@ -647,20 +654,20 @@ htable.close();</programlisting></para>
<section xml:id="perf.hdfs.curr"><title>Current Issues With Low-Latency Reads</title>
<para>The original use-case for HDFS was batch processing. As such, there low-latency reads were historically not a priority.
With the increased adoption of Apache HBase this is changing, and several improvements are already in development.
See the
See the
<link xlink:href="https://issues.apache.org/jira/browse/HDFS-1599">Umbrella Jira Ticket for HDFS Improvements for HBase</link>.
</para>
</section>
<section xml:id="perf.hdfs.comp"><title>Performance Comparisons of HBase vs. HDFS</title>
<para>A fairly common question on the dist-list is why HBase isn't as performant as HDFS files in a batch context (e.g., as
a MapReduce source or sink). The short answer is that HBase is doing a lot more than HDFS (e.g., reading the KeyValues,
returning the most current row or specified timestamps, etc.), and as such HBase is 4-5 times slower than HDFS in this
<para>A fairly common question on the dist-list is why HBase isn't as performant as HDFS files in a batch context (e.g., as
a MapReduce source or sink). The short answer is that HBase is doing a lot more than HDFS (e.g., reading the KeyValues,
returning the most current row or specified timestamps, etc.), and as such HBase is 4-5 times slower than HDFS in this
processing context. Not that there isn't room for improvement (and this gap will, over time, be reduced), but HDFS
will always be faster in this use-case.
</para>
</section>
</section>
<section xml:id="perf.ec2"><title>Amazon EC2</title>
<para>Performance questions are common on Amazon EC2 environments because it is a shared environment. You will
not see the same throughput as a dedicated server. In terms of running tests on EC2, run them several times for the same
@ -670,7 +677,7 @@ htable.close();</programlisting></para>
because EC2 issues are practically a separate class of performance issues.
</para>
</section>
<section xml:id="perf.casestudy"><title>Case Studies</title>
<para>For Performance and Troubleshooting Case Studies, see <xref linkend="casestudies"/>.
</para>