From c264d3f7ea058055708113e283fca8d188a42edf Mon Sep 17 00:00:00 2001 From: Michael Stack Date: Wed, 16 Jan 2013 18:22:08 +0000 Subject: [PATCH] Add to troubleshooting a note on zk buffer size issue when lots of regions -- hbase-4246 git-svn-id: https://svn.apache.org/repos/asf/hbase/trunk@1434087 13f79535-47bb-0310-9956-ffa450edef68 --- src/docbkx/troubleshooting.xml | 256 +++++++++++++++++---------------- 1 file changed, 131 insertions(+), 125 deletions(-) diff --git a/src/docbkx/troubleshooting.xml b/src/docbkx/troubleshooting.xml index 48d7210ef8e..5967b03a3d6 100644 --- a/src/docbkx/troubleshooting.xml +++ b/src/docbkx/troubleshooting.xml @@ -54,7 +54,7 @@ prolonged garbage collection pauses that last longer than the default ZooKeeper session timeout. For more information on GC pauses, see the 3 part blog post by Todd Lipcon - and above. + and above.
@@ -91,7 +91,7 @@ NameNode The NameNode log is on the NameNode server. The HBase Master is typically run on the NameNode server, and well as ZooKeeper. For smaller clusters the JobTracker is typically run on the NameNode server as well. -
+
DataNode Each DataNode server will have a DataNode log for HDFS, as well as a RegionServer log for HBase. @@ -105,32 +105,32 @@ insight on timings at the server. Once enabled, the amount of log spewed is voluminous. It is not recommended that you leave this logging on for more than short bursts of time. To enable RPC-level - logging, browse to the RegionServer UI and click on + logging, browse to the RegionServer UI and click on Log Level. Set the log level to DEBUG for the package org.apache.hadoop.ipc (Thats right, for hadoop.ipc, NOT, hbase.ipc). Then tail the RegionServers log. Analyze. To disable, set the logging level back to INFO level. -
- + +
JVM Garbage Collection Logs - HBase is memory intensive, and using the default GC you can see long pauses in all threads including the Juliet Pause aka "GC of Death". - To help debug this or confirm this is happening GC logging can be turned on in the Java virtual machine. + HBase is memory intensive, and using the default GC you can see long pauses in all threads including the Juliet Pause aka "GC of Death". + To help debug this or confirm this is happening GC logging can be turned on in the Java virtual machine. To enable, in hbase-env.sh add: - + export HBASE_OPTS="-XX:+UseConcMarkSweepGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:/home/hadoop/hbase/logs/gc-hbase.log" - Adjust the log directory to wherever you log. Note: The GC log does NOT roll automatically, so you'll have to keep an eye on it so it doesn't fill up the disk. + Adjust the log directory to wherever you log. Note: The GC log does NOT roll automatically, so you'll have to keep an eye on it so it doesn't fill up the disk. At this point you should see logs like so: -64898.952: [GC [1 CMS-initial-mark: 2811538K(3055704K)] 2812179K(3061272K), 0.0007360 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] +64898.952: [GC [1 CMS-initial-mark: 2811538K(3055704K)] 2812179K(3061272K), 0.0007360 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 64898.953: [CMS-concurrent-mark-start] -64898.971: [GC 64898.971: [ParNew: 5567K->576K(5568K), 0.0101110 secs] 2817105K->2812715K(3061272K), 0.0102200 secs] [Times: user=0.07 sys=0.00, real=0.01 secs] +64898.971: [GC 64898.971: [ParNew: 5567K->576K(5568K), 0.0101110 secs] 2817105K->2812715K(3061272K), 0.0102200 secs] [Times: user=0.07 sys=0.00, real=0.01 secs] @@ -139,20 +139,20 @@ export HBASE_OPTS="-XX:+UseConcMarkSweepGC -verbose:gc -XX:+PrintGCDetails -XX:+ The third line indicates a "minor GC", which pauses the VM for 0.0101110 seconds - aka 10 milliseconds. It has reduced the "ParNew" from about 5.5m to 576k. Later on in this cycle we see: - -64901.445: [CMS-concurrent-mark: 1.542/2.492 secs] [Times: user=10.49 sys=0.33, real=2.49 secs] + +64901.445: [CMS-concurrent-mark: 1.542/2.492 secs] [Times: user=10.49 sys=0.33, real=2.49 secs] 64901.445: [CMS-concurrent-preclean-start] -64901.453: [GC 64901.453: [ParNew: 5505K->573K(5568K), 0.0062440 secs] 2868746K->2864292K(3061272K), 0.0063360 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] -64901.476: [GC 64901.476: [ParNew: 5563K->575K(5568K), 0.0072510 secs] 2869283K->2864837K(3061272K), 0.0073320 secs] [Times: user=0.05 sys=0.01, real=0.01 secs] -64901.500: [GC 64901.500: [ParNew: 5517K->573K(5568K), 0.0120390 secs] 2869780K->2865267K(3061272K), 0.0121150 secs] [Times: user=0.09 sys=0.00, real=0.01 secs] -64901.529: [GC 64901.529: [ParNew: 5507K->569K(5568K), 0.0086240 secs] 2870200K->2865742K(3061272K), 0.0087180 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] -64901.554: [GC 64901.555: [ParNew: 5516K->575K(5568K), 0.0107130 secs] 2870689K->2866291K(3061272K), 0.0107820 secs] [Times: user=0.06 sys=0.00, real=0.01 secs] -64901.578: [CMS-concurrent-preclean: 0.070/0.133 secs] [Times: user=0.48 sys=0.01, real=0.14 secs] +64901.453: [GC 64901.453: [ParNew: 5505K->573K(5568K), 0.0062440 secs] 2868746K->2864292K(3061272K), 0.0063360 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] +64901.476: [GC 64901.476: [ParNew: 5563K->575K(5568K), 0.0072510 secs] 2869283K->2864837K(3061272K), 0.0073320 secs] [Times: user=0.05 sys=0.01, real=0.01 secs] +64901.500: [GC 64901.500: [ParNew: 5517K->573K(5568K), 0.0120390 secs] 2869780K->2865267K(3061272K), 0.0121150 secs] [Times: user=0.09 sys=0.00, real=0.01 secs] +64901.529: [GC 64901.529: [ParNew: 5507K->569K(5568K), 0.0086240 secs] 2870200K->2865742K(3061272K), 0.0087180 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] +64901.554: [GC 64901.555: [ParNew: 5516K->575K(5568K), 0.0107130 secs] 2870689K->2866291K(3061272K), 0.0107820 secs] [Times: user=0.06 sys=0.00, real=0.01 secs] +64901.578: [CMS-concurrent-preclean: 0.070/0.133 secs] [Times: user=0.48 sys=0.01, real=0.14 secs] 64901.578: [CMS-concurrent-abortable-preclean-start] -64901.584: [GC 64901.584: [ParNew: 5504K->571K(5568K), 0.0087270 secs] 2871220K->2866830K(3061272K), 0.0088220 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] -64901.609: [GC 64901.609: [ParNew: 5512K->569K(5568K), 0.0063370 secs] 2871771K->2867322K(3061272K), 0.0064230 secs] [Times: user=0.06 sys=0.00, real=0.01 secs] -64901.615: [CMS-concurrent-abortable-preclean: 0.007/0.037 secs] [Times: user=0.13 sys=0.00, real=0.03 secs] -64901.616: [GC[YG occupancy: 645 K (5568 K)]64901.616: [Rescan (parallel) , 0.0020210 secs]64901.618: [weak refs processing, 0.0027950 secs] [1 CMS-remark: 2866753K(3055704K)] 2867399K(3061272K), 0.0049380 secs] [Times: user=0.00 sys=0.01, real=0.01 secs] +64901.584: [GC 64901.584: [ParNew: 5504K->571K(5568K), 0.0087270 secs] 2871220K->2866830K(3061272K), 0.0088220 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] +64901.609: [GC 64901.609: [ParNew: 5512K->569K(5568K), 0.0063370 secs] 2871771K->2867322K(3061272K), 0.0064230 secs] [Times: user=0.06 sys=0.00, real=0.01 secs] +64901.615: [CMS-concurrent-abortable-preclean: 0.007/0.037 secs] [Times: user=0.13 sys=0.00, real=0.03 secs] +64901.616: [GC[YG occupancy: 645 K (5568 K)]64901.616: [Rescan (parallel) , 0.0020210 secs]64901.618: [weak refs processing, 0.0027950 secs] [1 CMS-remark: 2866753K(3055704K)] 2867399K(3061272K), 0.0049380 secs] [Times: user=0.00 sys=0.01, real=0.01 secs] 64901.621: [CMS-concurrent-sweep-start] @@ -161,20 +161,20 @@ export HBASE_OPTS="-XX:+UseConcMarkSweepGC -verbose:gc -XX:+PrintGCDetails -XX:+ There are a few more minor GCs, then there is a pause at the 2nd last line: - -64901.616: [GC[YG occupancy: 645 K (5568 K)]64901.616: [Rescan (parallel) , 0.0020210 secs]64901.618: [weak refs processing, 0.0027950 secs] [1 CMS-remark: 2866753K(3055704K)] 2867399K(3061272K), 0.0049380 secs] [Times: user=0.00 sys=0.01, real=0.01 secs] + +64901.616: [GC[YG occupancy: 645 K (5568 K)]64901.616: [Rescan (parallel) , 0.0020210 secs]64901.618: [weak refs processing, 0.0027950 secs] [1 CMS-remark: 2866753K(3055704K)] 2867399K(3061272K), 0.0049380 secs] [Times: user=0.00 sys=0.01, real=0.01 secs] - The pause here is 0.0049380 seconds (aka 4.9 milliseconds) to 'remark' the heap. + The pause here is 0.0049380 seconds (aka 4.9 milliseconds) to 'remark' the heap. At this point the sweep starts, and you can watch the heap size go down: -64901.637: [GC 64901.637: [ParNew: 5501K->569K(5568K), 0.0097350 secs] 2871958K->2867441K(3061272K), 0.0098370 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] +64901.637: [GC 64901.637: [ParNew: 5501K->569K(5568K), 0.0097350 secs] 2871958K->2867441K(3061272K), 0.0098370 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] ... lines removed ... -64904.936: [GC 64904.936: [ParNew: 5532K->568K(5568K), 0.0070720 secs] 1365024K->1360689K(3061272K), 0.0071930 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] -64904.953: [CMS-concurrent-sweep: 2.030/3.332 secs] [Times: user=9.57 sys=0.26, real=3.33 secs] +64904.936: [GC 64904.936: [ParNew: 5532K->568K(5568K), 0.0070720 secs] 1365024K->1360689K(3061272K), 0.0071930 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] +64904.953: [CMS-concurrent-sweep: 2.030/3.332 secs] [Times: user=9.57 sys=0.26, real=3.33 secs] At this point, the CMS sweep took 3.332 seconds, and heap went from about ~ 2.8 GB to 1.3 GB (approximate). @@ -186,14 +186,14 @@ export HBASE_OPTS="-XX:+UseConcMarkSweepGC -verbose:gc -XX:+PrintGCDetails -XX:+ Add this to HBASE_OPTS: - + export HBASE_OPTS="-XX:NewSize=64m -XX:MaxNewSize=64m <cms options from above> <gc logging options from above>" For more information on GC pauses, see the 3 part blog post by Todd Lipcon and above. - +
@@ -201,7 +201,7 @@ export HBASE_OPTS="-XX:NewSize=64m -XX:MaxNewSize=64m <cms options from above
search-hadoop.com - search-hadoop.com indexes all the mailing lists and is great for historical searches. + search-hadoop.com indexes all the mailing lists and is great for historical searches. Search here first when you have an issue as its more than likely someone has already had your problem.
@@ -212,7 +212,7 @@ export HBASE_OPTS="-XX:NewSize=64m -XX:MaxNewSize=64m <cms options from above is generally used for questions on released versions of Apache HBase. Before going to the mailing list, make sure your question has not already been answered by searching the mailing list archives first. Use . - Take some time crafting your questionSee Getting Answers; a quality question that includes all context and + Take some time crafting your questionSee Getting Answers; a quality question that includes all context and exhibits evidence the author has tried to find answers in the manual and out on lists is more likely to get a prompt response. @@ -236,7 +236,7 @@ export HBASE_OPTS="-XX:NewSize=64m -XX:MaxNewSize=64m <cms options from above Master Web Interface The Master starts a web-interface on port 60010 by default. - The Master web UI lists created tables and their definition (e.g., ColumnFamilies, blocksize, etc.). Additionally, + The Master web UI lists created tables and their definition (e.g., ColumnFamilies, blocksize, etc.). Additionally, the available RegionServers in the cluster are listed along with selected high-level metrics (requests, number of regions, usedHeap, maxHeap). The Master web UI allows navigation to each RegionServer's web UI. @@ -263,13 +263,13 @@ export HBASE_OPTS="-XX:NewSize=64m -XX:MaxNewSize=64m <cms options from above ls path [watch] set path data [version] delquota [-n|-b] path - quit + quit printwatches on|off create [-s] [-e] path data acl stat path [watch] - close + close ls2 path [watch] - history + history listquota path setAcl path acl getAcl path @@ -292,7 +292,7 @@ export HBASE_OPTS="-XX:NewSize=64m -XX:MaxNewSize=64m <cms options from above
top - + top is probably one of the most important tool when first trying to see what’s running on a machine and how the resources are consumed. Here’s an example from production system: top - 14:46:59 up 39 days, 11:55, 1 user, load average: 3.75, 3.57, 3.84 @@ -300,10 +300,10 @@ Tasks: 309 total, 1 running, 308 sleeping, 0 stopped, 0 zombie Cpu(s): 4.5%us, 1.6%sy, 0.0%ni, 91.7%id, 1.4%wa, 0.1%hi, 0.6%si, 0.0%st Mem: 24414432k total, 24296956k used, 117476k free, 7196k buffers Swap: 16008732k total, 14348k used, 15994384k free, 11106908k cached - - PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND -15558 hadoop 18 -2 3292m 2.4g 3556 S 79 10.4 6523:52 java -13268 hadoop 18 -2 8967m 8.2g 4104 S 21 35.1 5170:30 java + + PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND +15558 hadoop 18 -2 3292m 2.4g 3556 S 79 10.4 6523:52 java +13268 hadoop 18 -2 8967m 8.2g 4104 S 21 35.1 5170:30 java 8895 hadoop 18 -2 1581m 497m 3420 S 11 2.1 4002:32 java … @@ -351,7 +351,7 @@ hadoop@sv4borg12:~$ jps hadoop@sv4borg12:~$ ps aux | grep HRegionServer hadoop 17789 155 35.2 9067824 8604364 ? S<l Mar04 9855:48 /usr/java/jdk1.6.0_14/bin/java -Xmx8000m -XX:+DoEscapeAnalysis -XX:+AggressiveOpts -XX:+UseConcMarkSweepGC -XX:NewSize=64m -XX:MaxNewSize=64m -XX:CMSInitiatingOccupancyFraction=88 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:/export1/hadoop/logs/gc-hbase.log -Dcom.sun.management.jmxremote.port=10102 -Dcom.sun.management.jmxremote.authenticate=true -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.password.file=/home/hadoop/hbase/conf/jmxremote.password -Dcom.sun.management.jmxremote -Dhbase.log.dir=/export1/hadoop/logs -Dhbase.log.file=hbase-hadoop-regionserver-sv4borg12.log -Dhbase.home.dir=/home/hadoop/hbase -Dhbase.id.str=hadoop -Dhbase.root.logger=INFO,DRFA -Djava.library.path=/home/hadoop/hbase/lib/native/Linux-amd64-64 -classpath /home/hadoop/hbase/bin/../conf:[many jars]:/home/hadoop/hadoop/conf org.apache.hadoop.hbase.regionserver.HRegionServer start - +
@@ -371,7 +371,7 @@ hadoop 17789 155 35.2 9067824 8604364 ? S<l Mar04 9855:48 /usr/java/j at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:395) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:647) at java.lang.Thread.run(Thread.java:619) - + The MemStore flusher thread that is currently flushing to a file: "regionserver60020.cacheFlusher" daemon prio=10 tid=0x0000000040f4e000 nid=0x45eb in Object.wait() [0x00007f16b5b86000..0x00007f16b5b87af0] java.lang.Thread.State: WAITING (on object monitor) @@ -444,7 +444,7 @@ hadoop 17789 155 35.2 9067824 8604364 ? S<l Mar04 9855:48 /usr/java/j A thread that receives data from HDFS: - + "IPC Client (47) connection to sv4borg9/10.4.24.40:9000 from hadoop" daemon prio=10 tid=0x00007f16a02d0000 nid=0x4fa3 runnable [0x00007f16b517d000..0x00007f16b517dbf0] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) @@ -509,27 +509,27 @@ hadoop 17789 155 35.2 9067824 8604364 ? S<l Mar04 9855:48 /usr/java/j
clusterssh+top - - clusterssh+top, it’s like a poor man’s monitoring system and it can be quite useful when you have only a few machines as it’s very easy to setup. Starting clusterssh will give you one terminal per machine and another terminal in which whatever you type will be retyped in every window. This means that you can type “top” once and it will start it for all of your machines at the same time giving you full view of the current state of your cluster. You can also tail all the logs at the same time, edit files, etc. + + clusterssh+top, it’s like a poor man’s monitoring system and it can be quite useful when you have only a few machines as it’s very easy to setup. Starting clusterssh will give you one terminal per machine and another terminal in which whatever you type will be retyped in every window. This means that you can type “top” once and it will start it for all of your machines at the same time giving you full view of the current state of your cluster. You can also tail all the logs at the same time, edit files, etc.
- +
Client - For more information on the HBase client, see . + For more information on the HBase client, see .
ScannerTimeoutException or UnknownScannerException - This is thrown if the time between RPC calls from the client to RegionServer exceeds the scan timeout. + This is thrown if the time between RPC calls from the client to RegionServer exceeds the scan timeout. For example, if Scan.setCaching is set to 500, then there will be an RPC call to fetch the next batch of rows every 500 .next() calls on the ResultScanner because data is being transferred in blocks of 500 rows to the client. Reducing the setCaching value may be an option, but setting this value too low makes for inefficient processing on numbers of rows. See . -
+
<classname>LeaseException</classname> when calling <classname>Scanner.next</classname> @@ -546,27 +546,27 @@ Harsh J investigated the issue as part of the mailing list thread Shell or client application throws lots of scary exceptions during normal operation Since 0.20.0 the default log level for org.apache.hadoop.hbase.*is DEBUG. - On your clients, edit $HBASE_HOME/conf/log4j.properties and change this: log4j.logger.org.apache.hadoop.hbase=DEBUG to this: log4j.logger.org.apache.hadoop.hbase=INFO, or even log4j.logger.org.apache.hadoop.hbase=WARN. + On your clients, edit $HBASE_HOME/conf/log4j.properties and change this: log4j.logger.org.apache.hadoop.hbase=DEBUG to this: log4j.logger.org.apache.hadoop.hbase=INFO, or even log4j.logger.org.apache.hadoop.hbase=WARN. -
+
Long Client Pauses With Compression - This is a fairly frequent question on the Apache HBase dist-list. The scenario is that a client is typically inserting a lot of data into a + This is a fairly frequent question on the Apache HBase dist-list. The scenario is that a client is typically inserting a lot of data into a relatively un-optimized HBase cluster. Compression can exacerbate the pauses, although it is not the source of the problem. See on the pattern for pre-creating regions and confirm that the table isn't starting with a single region. - See for cluster configuration, particularly hbase.hstore.blockingStoreFiles, hbase.hregion.memstore.block.multiplier, + See for cluster configuration, particularly hbase.hstore.blockingStoreFiles, hbase.hregion.memstore.block.multiplier, MAX_FILESIZE (region size), and MEMSTORE_FLUSHSIZE. - A slightly longer explanation of why pauses can happen is as follows: Puts are sometimes blocked on the MemStores which are blocked by the flusher thread which is blocked because there are + A slightly longer explanation of why pauses can happen is as follows: Puts are sometimes blocked on the MemStores which are blocked by the flusher thread which is blocked because there are too many files to compact because the compactor is given too many small files to compact and has to compact the same data repeatedly. This situation can occur even with minor compactions. Compounding this situation, Apache HBase doesn't compress data in memory. Thus, the 64MB that lives in the MemStore could become a 6MB file after compression - which results in a smaller StoreFile. The upside is that more data is packed into the same region, but performance is achieved by being able to write larger files - which is why HBase waits until the flushize before writing a new StoreFile. And smaller StoreFiles - become targets for compaction. Without compression the files are much bigger and don't need as much compaction, however this is at the expense of I/O. + become targets for compaction. Without compression the files are much bigger and don't need as much compaction, however this is at the expense of I/O. For additional information, see this thread on Long client pauses with compression. - -
+ +
ZooKeeper Client Connection Errors Errors like this... @@ -588,11 +588,11 @@ Harsh J investigated the issue as part of the mailing list thread 11/07/05 11:26:45 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181 - ... are either due to ZooKeeper being down, or unreachable due to network issues. + ... are either due to ZooKeeper being down, or unreachable due to network issues. The utility may help investigate ZooKeeper issues. -
+
Client running out of memory though heap size seems to be stable (but the off-heap/direct heap keeps growing) @@ -607,15 +607,15 @@ it a bit hefty. You want to make this setting client-side only especially if y server-side off-heap cache since this feature depends on being able to use big direct buffers (You may have to keep separate client-side and server-side config dirs). -
+
Client Slowdown When Calling Admin Methods (flush, compact, etc.) This is a client issue fixed by HBASE-5073 in 0.90.6. -There was a ZooKeeper leak in the client and the client was getting pummeled by ZooKeeper events with each additional -invocation of the admin API. +There was a ZooKeeper leak in the client and the client was getting pummeled by ZooKeeper events with each additional +invocation of the admin API. -
+
Secure Client Cannot Connect ([Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]) @@ -637,16 +637,16 @@ Finally, depending on your Kerberos configuration, you may need to install the < You may also need to download the unlimited strength JCE policy files. Uncompress and extract the downloaded file, and install the policy jars into <java-home>/lib/security. -
+ - +
MapReduce
You Think You're On The Cluster, But You're Actually Local This following stacktrace happened using ImportTsv, but things like this - can happen on any job with a mis-configuration. + can happen on any job with a mis-configuration. WARN mapred.LocalJobRunner: job_local_0001 java.lang.IllegalArgumentException: Can't read partitions file @@ -671,17 +671,17 @@ Caused by: java.io.FileNotFoundException: File _partition.lst does not exist. LocalJobRunner means the job is running locally, not on the cluster. - See + See - http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#classpath for more + http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#classpath for more information on HBase MapReduce jobs and classpaths. - +
- +
NameNode - For more information on the NameNode, see . + For more information on the NameNode, see .
HDFS Utilization of Tables and Regions @@ -691,7 +691,7 @@ Caused by: java.io.FileNotFoundException: File _partition.lst does not exist. hadoop fs -du /hbase/myTable ...returns a list of the regions under the HBase table 'myTable' and their disk utilization. For more information on HDFS shell commands, see the HDFS FileSystem Shell documentation. -
+
Browsing HDFS for HBase Objects Somtimes it will be necessary to explore the HBase objects that exist on HDFS. These objects could include the WALs (Write Ahead Logs), tables, regions, StoreFiles, etc. @@ -709,30 +709,30 @@ Caused by: java.io.FileNotFoundException: File _partition.lst does not exist. The HDFS directory structure of HBase WAL is.. /hbase - /.logs + /.logs /<RegionServer> (RegionServers) /<HLog> (WAL HLog files for the RegionServer) - See the HDFS User Guide for other non-shell diagnostic - utilities like fsck. + See the HDFS User Guide for other non-shell diagnostic + utilities like fsck.
Use Cases - Two common use-cases for querying HDFS for HBase objects is research the degree of uncompaction of a table. If there are a large number of StoreFiles for each ColumnFamily it could + Two common use-cases for querying HDFS for HBase objects is research the degree of uncompaction of a table. If there are a large number of StoreFiles for each ColumnFamily it could indicate the need for a major compaction. Additionally, after a major compaction if the resulting StoreFile is "small" it could indicate the need for a reduction of ColumnFamilies for the table.
-
+ - +
Network
Network Spikes - If you are seeing periodic network spikes you might want to check the compactionQueues to see if major + If you are seeing periodic network spikes you might want to check the compactionQueues to see if major compactions are happening. See for more information on managing compactions. @@ -748,12 +748,12 @@ Caused by: java.io.FileNotFoundException: File _partition.lst does not exist. Are all the network interfaces functioning correctly? Are you sure? See the Troubleshooting Case Study in .
- +
- +
RegionServer - For more information on the RegionServers, see . + For more information on the RegionServers, see .
Startup Errors @@ -761,9 +761,9 @@ Caused by: java.io.FileNotFoundException: File _partition.lst does not exist. Master Starts, But RegionServers Do Not The Master believes the RegionServers have the IP of 127.0.0.1 - which is localhost and resolves to the master's own localhost. - The RegionServers are erroneously informing the Master that their IP addresses are 127.0.0.1. + The RegionServers are erroneously informing the Master that their IP addresses are 127.0.0.1. - Modify /etc/hosts on the region servers, from... + Modify /etc/hosts on the region servers, from... # Do not remove the following line, or various programs # that require network functionality will fail. @@ -779,7 +779,7 @@ Caused by: java.io.FileNotFoundException: File _partition.lst does not exist.
- +
Compression Link Errors @@ -793,8 +793,8 @@ java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path .. then there is a path issue with the compression libraries. See the Configuration section on LZO compression configuration. -
-
+ +
Runtime Errors @@ -807,7 +807,7 @@ java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path Adding -XX:+UseMembar to the HBase HBASE_OPTS in conf/hbase-env.sh may fix it. - Also, are you using ? These are discouraged because they can lock up the + Also, are you using ? These are discouraged because they can lock up the RegionServers if not managed properly.
@@ -816,7 +816,7 @@ java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path If you see log messages like this... -2010-09-13 01:24:17,336 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: +2010-09-13 01:24:17,336 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Disk-related IOException in BlockReceiver constructor. Cause is java.io.IOException: Too many open files at java.io.UnixFileSystem.createFileExclusively(Native Method) at java.io.File.createNewFile(File.java:883) @@ -847,7 +847,7 @@ Disk-related IOException in BlockReceiver constructor. Cause is java.io.IOExcept 2009-02-24 10:01:33,516 WARN org.apache.hadoop.hbase.util.Sleeper: We slept xxx ms, ten times longer than scheduled: 10000 2009-02-24 10:01:33,516 WARN org.apache.hadoop.hbase.util.Sleeper: We slept xxx ms, ten times longer than scheduled: 15000 -2009-02-24 10:01:36,472 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: unable to report to master for xxx milliseconds - retrying +2009-02-24 10:01:36,472 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: unable to report to master for xxx milliseconds - retrying ... or see full GC compactions then you may be experiencing full GC's. @@ -878,12 +878,12 @@ java.io.IOException: Session Expired at org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589) at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945) -ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expired +ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expired The JVM is doing a long running garbage collecting which is pausing every threads (aka "stop the world"). Since the RegionServer's local ZooKeeper client cannot send heartbeats, the session times out. - By design, we shut down any node that isn't able to contact the ZooKeeper ensemble after getting a timeout so that it stops serving data that may already be assigned elsewhere. + By design, we shut down any node that isn't able to contact the ZooKeeper ensemble after getting a timeout so that it stops serving data that may already be assigned elsewhere. @@ -892,7 +892,7 @@ ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expi Make sure you are not CPU starving the RegionServer thread. For example, if you are running a MapReduce job using 6 CPU-intensive tasks on a machine with 4 cores, you are probably starving the RegionServer enough to create longer garbage collection pauses. Increase the ZooKeeper session timeout - If you wish to increase the session timeout, add the following to your hbase-site.xml to increase the timeout from the default of 60 seconds to 120 seconds. + If you wish to increase the session timeout, add the following to your hbase-site.xml to increase the timeout from the default of 60 seconds to 120 seconds. <property> <name>zookeeper.session.timeout</name> @@ -906,8 +906,8 @@ ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expi Be aware that setting a higher timeout means that the regions served by a failed RegionServer will take at least - that amount of time to be transfered to another RegionServer. For a production system serving live requests, we would instead - recommend setting it lower than 1 minute and over-provision your cluster in order the lower the memory load on each machines (hence having + that amount of time to be transfered to another RegionServer. For a production system serving live requests, we would instead + recommend setting it lower than 1 minute and over-provision your cluster in order the lower the memory load on each machines (hence having less garbage to collect per machine). @@ -948,35 +948,41 @@ ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expi - +
Shutdown Errors -
+ - +
Master - For more information on the Master, see . + For more information on the Master, see .
Startup Errors
Master says that you need to run the hbase migrations script Upon running that, the hbase migrations script says no files in root directory. - HBase expects the root directory to either not exist, or to have already been initialized by hbase running a previous time. If you create a new directory for HBase using Hadoop DFS, this error will occur. - Make sure the HBase root directory does not currently exist or has been initialized by a previous run of HBase. Sure fire solution is to just use Hadoop dfs to delete the HBase root and let HBase create and initialize the directory itself. - + HBase expects the root directory to either not exist, or to have already been initialized by hbase running a previous time. If you create a new directory for HBase using Hadoop DFS, this error will occur. + Make sure the HBase root directory does not currently exist or has been initialized by a previous run of HBase. Sure fire solution is to just use Hadoop dfs to delete the HBase root and let HBase create and initialize the directory itself. +
- -
+
+ Packet len6080218 is out of range! + If you have many regions on your cluster and you see an error + like that reported above in this sections title in your logs, see + HBASE-4246 Cluster with too many regions cannot withstand some master failover scenarios. +
+ +
Shutdown Errors -
+ - +
ZooKeeper @@ -985,28 +991,28 @@ ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expi
Could not find my address: xyz in list of ZooKeeper quorum servers A ZooKeeper server wasn't able to start, throws that error. xyz is the name of your server. - This is a name lookup problem. HBase tries to start a ZooKeeper server on some machine but that machine isn't able to find itself in the hbase.zookeeper.quorum configuration. - - Use the hostname presented in the error message instead of the value you used. If you have a DNS server, you can set hbase.zookeeper.dns.interface and hbase.zookeeper.dns.nameserver in hbase-site.xml to make sure it resolves to the correct FQDN. - + This is a name lookup problem. HBase tries to start a ZooKeeper server on some machine but that machine isn't able to find itself in the hbase.zookeeper.quorum configuration. + + Use the hostname presented in the error message instead of the value you used. If you have a DNS server, you can set hbase.zookeeper.dns.interface and hbase.zookeeper.dns.nameserver in hbase-site.xml to make sure it resolves to the correct FQDN. +
- -
+ +
ZooKeeper, The Cluster Canary ZooKeeper is the cluster's "canary in the mineshaft". It'll be the first to notice issues if any so making sure its happy is the short-cut to a humming cluster. - + See the ZooKeeper Operating Environment Troubleshooting page. It has suggestions and tools for checking disk and networking performance; i.e. the operating environment your ZooKeeper and HBase are running in. Additionally, the utility may help investigate ZooKeeper issues. -
+ - +
- Amazon EC2 + Amazon EC2
ZooKeeper does not seem to work on Amazon EC2 HBase does not start when deployed as Amazon EC2 instances. Exceptions like the below appear in the Master and/or RegionServer logs: @@ -1018,8 +1024,8 @@ ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expi java.net.ConnectException: Connection refused - Security group policy is blocking the ZooKeeper port on a public address. - Use the internal EC2 host names when configuring the ZooKeeper quorum peer list. + Security group policy is blocking the ZooKeeper port on a public address. + Use the internal EC2 host names when configuring the ZooKeeper quorum peer list.
@@ -1033,11 +1039,11 @@ ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expi See Andrew's answer here, up on the user list: Remote Java client connection into EC2 instance.
- +
- +
- HBase and Hadoop version issues + HBase and Hadoop version issues
<code>NoClassDefFoundError</code> when trying to run 0.90.x on hadoop-0.20.205.x (or hadoop-1.0.x) Apache HBase 0.90.x does not ship with hadoop-0.20.205.x, etc. To make it run, you need to replace the hadoop @@ -1069,7 +1075,7 @@ org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate w ... ...are you trying to talk to an Hadoop 2.0.x from an HBase that has an Hadoop 1.0.x client? Use the HBase built against Hadoop 2.0 or rebuild your HBase passing the -Dhadoop.profile=2.0 -attribute to Maven (See for more). +attribute to Maven (See for more).
@@ -1080,5 +1086,5 @@ attribute to Maven (See for more). For Performance and Troubleshooting Case Studies, see .
- +