diff --git a/src/docbkx/performance.xml b/src/docbkx/performance.xml index ba13ee04fbe..e46a4c9297d 100644 --- a/src/docbkx/performance.xml +++ b/src/docbkx/performance.xml @@ -47,7 +47,7 @@ Network Perhaps the most important factor in avoiding network issues degrading Hadoop and HBbase performance is the switching hardware - that is used, decisions made early in the scope of the project can cause major problems when you double or triple the size of your cluster (or more). + that is used, decisions made early in the scope of the project can cause major problems when you double or triple the size of your cluster (or more). Important items to consider: @@ -59,15 +59,15 @@
Single Switch - The single most important factor in this configuration is that the switching capacity of the hardware is capable of + The single most important factor in this configuration is that the switching capacity of the hardware is capable of handling the traffic which can be generated by all systems connected to the switch. Some lower priced commodity hardware - can have a slower switching capacity than could be utilized by a full switch. + can have a slower switching capacity than could be utilized by a full switch.
Multiple Switches Multiple switches are a potential pitfall in the architecture. The most common configuration of lower priced hardware is a - simple 1Gbps uplink from one switch to another. This often overlooked pinch point can easily become a bottleneck for cluster communication. + simple 1Gbps uplink from one switch to another. This often overlooked pinch point can easily become a bottleneck for cluster communication. Especially with MapReduce jobs that are both reading and writing a lot of data the communication across this uplink could be saturated. Mitigation of this issue is fairly simple and can be accomplished in multiple ways: @@ -85,10 +85,10 @@ Poor switch capacity performance Insufficient uplink to another rack - If the the switches in your rack have appropriate switching capacity to handle all the hosts at full speed, the next most likely issue will be caused by homing + If the the switches in your rack have appropriate switching capacity to handle all the hosts at full speed, the next most likely issue will be caused by homing more of your cluster across racks. The easiest way to avoid issues when spanning multiple racks is to use port trunking to create a bonded uplink to other racks. The downside of this method however, is in the overhead of ports that could potentially be used. An example of this is, creating an 8Gbps port channel from rack - A to rack B, using 8 of your 24 ports to communicate between racks gives you a poor ROI, using too few however can mean you're not getting the most out of your cluster. + A to rack B, using 8 of your 24 ports to communicate between racks gives you a poor ROI, using too few however can mean you're not getting the most out of your cluster. Using 10Gbe links between racks will greatly increase performance, and assuming your switches support a 10Gbe uplink or allow for an expansion card will allow you to save your ports for machines as opposed to uplinks. @@ -128,7 +128,14 @@ slides for background and detailThe latest jvms do better regards fragmentation so make sure you are running a recent release. Read down in the message, - Identifying concurrent mode failures caused by fragmentation.. + Identifying concurrent mode failures caused by fragmentation.. + Be aware that when enabled, each MemStore instance will occupy at least + an MSLAB instance of memory. If you have thousands of regions or lots + of regions each with many column families, this allocation of MSLAB + may be responsible for a good portion of your heap allocation and in + an extreme case cause you to OOME. Disable MSLAB in this case, or + lower the amount of memory it uses or float less regions per server. + For more information about GC logs, see .
@@ -159,44 +166,44 @@
<varname>hbase.regionserver.handler.count</varname> - See . + See .
<varname>hfile.block.cache.size</varname> - See . + See . A memory setting for the RegionServer process. -
+
<varname>hbase.regionserver.global.memstore.upperLimit</varname> - See . + See . This memory setting is often adjusted for the RegionServer process depending on needs. -
+
<varname>hbase.regionserver.global.memstore.lowerLimit</varname> - See . + See . This memory setting is often adjusted for the RegionServer process depending on needs.
<varname>hbase.hstore.blockingStoreFiles</varname> - See . + See . If there is blocking in the RegionServer logs, increasing this can help.
<varname>hbase.hregion.memstore.block.multiplier</varname> - See . - If there is enough RAM, increasing this can help. + See . + If there is enough RAM, increasing this can help.
<varname>hbase.regionserver.checksum.verify</varname> Have HBase write the checksum into the datablock and save having to do the checksum seek whenever you read. See the - release note on HBASE-5074 support checksums in HBase block cache. + release note on HBASE-5074 support checksums in HBase block cache.
@@ -218,7 +225,7 @@ more discussion around short circuit reads. To enable "short circuit" reads, you must set two configurations. First, the hdfs-site.xml needs to be amended. Set -the property dfs.block.local-path-access.user +the property dfs.block.local-path-access.user to be the only user that can use the shortcut. This has to be the user that started HBase. Then in hbase-site.xml, set dfs.client.read.shortcircuit to be true @@ -241,19 +248,19 @@ the data will still be read.
Schema Design - +
Number of Column Families See .
Key and Attribute Lengths - See . See also for + See . See also for compression caveats.
Table RegionSize The regionsize can be set on a per-table basis via setFileSize on - HTableDescriptor in the + HTableDescriptor in the event where certain tables require different regionsizes than the configured default regionsize. See for more information. @@ -269,23 +276,23 @@ the data will still be read. on each insert. If ROWCOL, the hash of the row + column family + column family qualifier will be added to the bloom on each key insert. - See HColumnDescriptor and + See HColumnDescriptor and for more information or this answer up in quora, How are bloom filters used in HBase?.
ColumnFamily BlockSize - The blocksize can be configured for each ColumnFamily in a table, and this defaults to 64k. Larger cell values require larger blocksizes. + The blocksize can be configured for each ColumnFamily in a table, and this defaults to 64k. Larger cell values require larger blocksizes. There is an inverse relationship between blocksize and the resulting StoreFile indexes (i.e., if the blocksize is doubled then the resulting indexes should be roughly halved). - See HColumnDescriptor + See HColumnDescriptor and for more information.
In-Memory ColumnFamilies - ColumnFamilies can optionally be defined as in-memory. Data is still persisted to disk, just like any other ColumnFamily. + ColumnFamilies can optionally be defined as in-memory. Data is still persisted to disk, just like any other ColumnFamily. In-memory blocks have the highest priority in the , but it is not a guarantee that the entire table will be in memory. @@ -297,17 +304,17 @@ the data will still be read. Production systems should use compression with their ColumnFamily definitions. See for more information.
However... - Compression deflates data on disk. When it's in-memory (e.g., in the + Compression deflates data on disk. When it's in-memory (e.g., in the MemStore) or on the wire (e.g., transferring between RegionServer and Client) it's inflated. So while using ColumnFamily compression is a best practice, but it's not going to completely eliminate - the impact of over-sized Keys, over-sized ColumnFamily names, or over-sized Column names. + the impact of over-sized Keys, over-sized ColumnFamily names, or over-sized Column names. See on for schema design tips, and for more information on HBase stores data internally. - +
- +
Writing to HBase @@ -335,7 +342,7 @@ throws IOException { } catch (TableExistsException e) { logger.info("table " + table.getNameAsString() + " already exists"); // the table already exists... - return false; + return false; } } @@ -360,7 +367,7 @@ public static byte[][] getHexSplits(String startKey, String endKey, int numRegio Table Creation: Deferred Log Flush -The default behavior for Puts using the Write Ahead Log (WAL) is that HLog edits will be written immediately. If deferred log flush is used, +The default behavior for Puts using the Write Ahead Log (WAL) is that HLog edits will be written immediately. If deferred log flush is used, WAL edits are kept in memory until the flush period. The benefit is aggregated and asynchronous HLog- writes, but the potential downside is that if the RegionServer goes down the yet-to-be-flushed edits are lost. This is safer, however, than not using WAL at all with Puts. @@ -368,7 +375,7 @@ WAL edits are kept in memory until the flush period. The benefit is aggregated Deferred log flush can be configured on tables via HTableDescriptor. The default value of hbase.regionserver.optionallogflushinterval is 1000ms. -
+
HBase Client: AutoFlush @@ -394,25 +401,25 @@ Deferred log flush can be configured on tables via In general, it is best to use WAL for Puts, and where loading throughput - is a concern to use bulk loading techniques instead. + is a concern to use bulk loading techniques instead.
HBase Client: Group Puts by RegionServer - In addition to using the writeBuffer, grouping Puts by RegionServer can reduce the number of client RPC calls per writeBuffer flush. + In addition to using the writeBuffer, grouping Puts by RegionServer can reduce the number of client RPC calls per writeBuffer flush. There is a utility HTableUtil currently on TRUNK that does this, but you can either copy that or implement your own verison for those still on 0.90.x or earlier. -
+
MapReduce: Skip The Reducer When writing a lot of data to an HBase table from a MR job (e.g., with TableOutputFormat), and specifically where Puts are being emitted - from the Mapper, skip the Reducer step. When a Reducer step is used, all of the output (Puts) from the Mapper will get spooled to disk, then sorted/shuffled to other - Reducers that will most likely be off-node. It's far more efficient to just write directly to HBase. + from the Mapper, skip the Reducer step. When a Reducer step is used, all of the output (Puts) from the Mapper will get spooled to disk, then sorted/shuffled to other + Reducers that will most likely be off-node. It's far more efficient to just write directly to HBase. - For summary jobs where HBase is used as a source and a sink, then writes will be coming from the Reducer step (e.g., summarize values then write out result). - This is a different processing problem than from the the above case. + For summary jobs where HBase is used as a source and a sink, then writes will be coming from the Reducer step (e.g., summarize values then write out result). + This is a different processing problem than from the the above case.
@@ -421,16 +428,16 @@ Deferred log flush can be configured on tables via If all your data is being written to one region at a time, then re-read the section on processing timeseries data.
Also, if you are pre-splitting regions and all your data is still winding up in a single region even though - your keys aren't monotonically increasing, confirm that your keyspace actually works with the split strategy. There are a + your keys aren't monotonically increasing, confirm that your keyspace actually works with the split strategy. There are a variety of reasons that regions may appear "well split" but won't work with your data. As - the HBase client communicates directly with the RegionServers, this can be obtained via + the HBase client communicates directly with the RegionServers, this can be obtained via HTable.getRegionLocation. - See , as well as + See , as well as - +
Reading from HBase @@ -452,7 +459,7 @@ Deferred log flush can be configured on tables via Scan settings in MapReduce jobs deserve special attention. Timeouts can result (e.g., UnknownScannerException) in Map tasks if it takes longer to process a batch of records before the client goes back to the RegionServer for the next set of data. This problem can occur because there is non-trivial processing occuring per row. If you process - rows quickly, set caching higher. If you process rows more slowly (e.g., lots of transformations per row, writes), + rows quickly, set caching higher. If you process rows more slowly (e.g., lots of transformations per row, writes), then set caching lower. Timeouts can also happen in a non-MapReduce use case (i.e., single threaded HBase client doing a Scan), but the @@ -472,8 +479,8 @@ Deferred log flush can be configured on tables via
MapReduce - Input Splits - For MapReduce jobs that use HBase tables as a source, if there a pattern where the "slow" map tasks seem to - have the same Input Split (i.e., the RegionServer serving the data), see the + For MapReduce jobs that use HBase tables as a source, if there a pattern where the "slow" map tasks seem to + have the same Input Split (i.e., the RegionServer serving the data), see the Troubleshooting Case Study in .
@@ -522,9 +529,9 @@ htable.close();
Concurrency: Monitor Data Spread - When performing a high number of concurrent reads, monitor the data spread of the target tables. If the target table(s) have + When performing a high number of concurrent reads, monitor the data spread of the target tables. If the target table(s) have too few regions then the reads could likely be served from too few nodes. - See , as well as + See , as well as
Bloom Filters @@ -554,7 +561,7 @@ htable.close(); See also . - +
Bloom StoreFile footprint @@ -584,7 +591,7 @@ htable.close(); data. Obtained on-demand. Stored in the LRU cache, if it is enabled (Its enabled by default).
-
+
Bloom Filter Configuration
@@ -615,10 +622,10 @@ htable.close(); in HBase for more on what this option means.
- - + + - +
Deleting from HBase
@@ -647,20 +654,20 @@ htable.close();
Current Issues With Low-Latency Reads The original use-case for HDFS was batch processing. As such, there low-latency reads were historically not a priority. With the increased adoption of Apache HBase this is changing, and several improvements are already in development. - See the + See the Umbrella Jira Ticket for HDFS Improvements for HBase.
Performance Comparisons of HBase vs. HDFS - A fairly common question on the dist-list is why HBase isn't as performant as HDFS files in a batch context (e.g., as - a MapReduce source or sink). The short answer is that HBase is doing a lot more than HDFS (e.g., reading the KeyValues, - returning the most current row or specified timestamps, etc.), and as such HBase is 4-5 times slower than HDFS in this + A fairly common question on the dist-list is why HBase isn't as performant as HDFS files in a batch context (e.g., as + a MapReduce source or sink). The short answer is that HBase is doing a lot more than HDFS (e.g., reading the KeyValues, + returning the most current row or specified timestamps, etc.), and as such HBase is 4-5 times slower than HDFS in this processing context. Not that there isn't room for improvement (and this gap will, over time, be reduced), but HDFS will always be faster in this use-case.
- +
Amazon EC2 Performance questions are common on Amazon EC2 environments because it is a shared environment. You will not see the same throughput as a dedicated server. In terms of running tests on EC2, run them several times for the same @@ -670,7 +677,7 @@ htable.close(); because EC2 issues are practically a separate class of performance issues.
- +
Case Studies For Performance and Troubleshooting Case Studies, see .