Add to troubleshooting a note on zk buffer size issue when lots of regions -- hbase-4246

git-svn-id: https://svn.apache.org/repos/asf/hbase/trunk@1434087 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
Michael Stack 2013-01-16 18:22:08 +00:00
parent b892e48c7e
commit c264d3f7ea
1 changed files with 131 additions and 125 deletions

View File

@ -54,7 +54,7 @@
prolonged garbage collection pauses that last longer than the default ZooKeeper session timeout. prolonged garbage collection pauses that last longer than the default ZooKeeper session timeout.
For more information on GC pauses, see the For more information on GC pauses, see the
<link xlink:href="http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/">3 part blog post</link> by Todd Lipcon <link xlink:href="http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/">3 part blog post</link> by Todd Lipcon
and <xref linkend="gcpause" /> above. and <xref linkend="gcpause" /> above.
</para> </para>
</section> </section>
<section xml:id="trouble.log"> <section xml:id="trouble.log">
@ -91,7 +91,7 @@
<title>NameNode</title> <title>NameNode</title>
<para>The NameNode log is on the NameNode server. The HBase Master is typically run on the NameNode server, and well as ZooKeeper.</para> <para>The NameNode log is on the NameNode server. The HBase Master is typically run on the NameNode server, and well as ZooKeeper.</para>
<para>For smaller clusters the JobTracker is typically run on the NameNode server as well.</para> <para>For smaller clusters the JobTracker is typically run on the NameNode server as well.</para>
</section> </section>
<section xml:id="trouble.log.locations.datanode"> <section xml:id="trouble.log.locations.datanode">
<title>DataNode</title> <title>DataNode</title>
<para>Each DataNode server will have a DataNode log for HDFS, as well as a RegionServer log for HBase.</para> <para>Each DataNode server will have a DataNode log for HDFS, as well as a RegionServer log for HBase.</para>
@ -105,32 +105,32 @@
insight on timings at the server. Once enabled, the amount of log insight on timings at the server. Once enabled, the amount of log
spewed is voluminous. It is not recommended that you leave this spewed is voluminous. It is not recommended that you leave this
logging on for more than short bursts of time. To enable RPC-level logging on for more than short bursts of time. To enable RPC-level
logging, browse to the RegionServer UI and click on logging, browse to the RegionServer UI and click on
<emphasis>Log Level</emphasis>. Set the log level to <varname>DEBUG</varname> for the package <emphasis>Log Level</emphasis>. Set the log level to <varname>DEBUG</varname> for the package
<classname>org.apache.hadoop.ipc</classname> (Thats right, for <classname>org.apache.hadoop.ipc</classname> (Thats right, for
<classname>hadoop.ipc</classname>, NOT, <classname>hbase.ipc</classname>). Then tail the RegionServers log. Analyze.</para> <classname>hadoop.ipc</classname>, NOT, <classname>hbase.ipc</classname>). Then tail the RegionServers log. Analyze.</para>
<para>To disable, set the logging level back to <varname>INFO</varname> level. <para>To disable, set the logging level back to <varname>INFO</varname> level.
</para> </para>
</section> </section>
</section> </section>
<section xml:id="trouble.log.gc"> <section xml:id="trouble.log.gc">
<title>JVM Garbage Collection Logs</title> <title>JVM Garbage Collection Logs</title>
<para>HBase is memory intensive, and using the default GC you can see long pauses in all threads including the <emphasis>Juliet Pause</emphasis> aka "GC of Death". <para>HBase is memory intensive, and using the default GC you can see long pauses in all threads including the <emphasis>Juliet Pause</emphasis> aka "GC of Death".
To help debug this or confirm this is happening GC logging can be turned on in the Java virtual machine. To help debug this or confirm this is happening GC logging can be turned on in the Java virtual machine.
</para> </para>
<para> <para>
To enable, in <filename>hbase-env.sh</filename> add: To enable, in <filename>hbase-env.sh</filename> add:
<programlisting> <programlisting>
export HBASE_OPTS="-XX:+UseConcMarkSweepGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:/home/hadoop/hbase/logs/gc-hbase.log" export HBASE_OPTS="-XX:+UseConcMarkSweepGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:/home/hadoop/hbase/logs/gc-hbase.log"
</programlisting> </programlisting>
Adjust the log directory to wherever you log. Note: The GC log does NOT roll automatically, so you'll have to keep an eye on it so it doesn't fill up the disk. Adjust the log directory to wherever you log. Note: The GC log does NOT roll automatically, so you'll have to keep an eye on it so it doesn't fill up the disk.
</para> </para>
<para> <para>
At this point you should see logs like so: At this point you should see logs like so:
<programlisting> <programlisting>
64898.952: [GC [1 CMS-initial-mark: 2811538K(3055704K)] 2812179K(3061272K), 0.0007360 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 64898.952: [GC [1 CMS-initial-mark: 2811538K(3055704K)] 2812179K(3061272K), 0.0007360 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
64898.953: [CMS-concurrent-mark-start] 64898.953: [CMS-concurrent-mark-start]
64898.971: [GC 64898.971: [ParNew: 5567K->576K(5568K), 0.0101110 secs] 2817105K->2812715K(3061272K), 0.0102200 secs] [Times: user=0.07 sys=0.00, real=0.01 secs] 64898.971: [GC 64898.971: [ParNew: 5567K->576K(5568K), 0.0101110 secs] 2817105K->2812715K(3061272K), 0.0102200 secs] [Times: user=0.07 sys=0.00, real=0.01 secs]
</programlisting> </programlisting>
</para> </para>
<para> <para>
@ -139,20 +139,20 @@ export HBASE_OPTS="-XX:+UseConcMarkSweepGC -verbose:gc -XX:+PrintGCDetails -XX:+
<para> <para>
The third line indicates a "minor GC", which pauses the VM for 0.0101110 seconds - aka 10 milliseconds. It has reduced the "ParNew" from about 5.5m to 576k. The third line indicates a "minor GC", which pauses the VM for 0.0101110 seconds - aka 10 milliseconds. It has reduced the "ParNew" from about 5.5m to 576k.
Later on in this cycle we see: Later on in this cycle we see:
<programlisting> <programlisting>
64901.445: [CMS-concurrent-mark: 1.542/2.492 secs] [Times: user=10.49 sys=0.33, real=2.49 secs] 64901.445: [CMS-concurrent-mark: 1.542/2.492 secs] [Times: user=10.49 sys=0.33, real=2.49 secs]
64901.445: [CMS-concurrent-preclean-start] 64901.445: [CMS-concurrent-preclean-start]
64901.453: [GC 64901.453: [ParNew: 5505K->573K(5568K), 0.0062440 secs] 2868746K->2864292K(3061272K), 0.0063360 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] 64901.453: [GC 64901.453: [ParNew: 5505K->573K(5568K), 0.0062440 secs] 2868746K->2864292K(3061272K), 0.0063360 secs] [Times: user=0.05 sys=0.00, real=0.01 secs]
64901.476: [GC 64901.476: [ParNew: 5563K->575K(5568K), 0.0072510 secs] 2869283K->2864837K(3061272K), 0.0073320 secs] [Times: user=0.05 sys=0.01, real=0.01 secs] 64901.476: [GC 64901.476: [ParNew: 5563K->575K(5568K), 0.0072510 secs] 2869283K->2864837K(3061272K), 0.0073320 secs] [Times: user=0.05 sys=0.01, real=0.01 secs]
64901.500: [GC 64901.500: [ParNew: 5517K->573K(5568K), 0.0120390 secs] 2869780K->2865267K(3061272K), 0.0121150 secs] [Times: user=0.09 sys=0.00, real=0.01 secs] 64901.500: [GC 64901.500: [ParNew: 5517K->573K(5568K), 0.0120390 secs] 2869780K->2865267K(3061272K), 0.0121150 secs] [Times: user=0.09 sys=0.00, real=0.01 secs]
64901.529: [GC 64901.529: [ParNew: 5507K->569K(5568K), 0.0086240 secs] 2870200K->2865742K(3061272K), 0.0087180 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] 64901.529: [GC 64901.529: [ParNew: 5507K->569K(5568K), 0.0086240 secs] 2870200K->2865742K(3061272K), 0.0087180 secs] [Times: user=0.05 sys=0.00, real=0.01 secs]
64901.554: [GC 64901.555: [ParNew: 5516K->575K(5568K), 0.0107130 secs] 2870689K->2866291K(3061272K), 0.0107820 secs] [Times: user=0.06 sys=0.00, real=0.01 secs] 64901.554: [GC 64901.555: [ParNew: 5516K->575K(5568K), 0.0107130 secs] 2870689K->2866291K(3061272K), 0.0107820 secs] [Times: user=0.06 sys=0.00, real=0.01 secs]
64901.578: [CMS-concurrent-preclean: 0.070/0.133 secs] [Times: user=0.48 sys=0.01, real=0.14 secs] 64901.578: [CMS-concurrent-preclean: 0.070/0.133 secs] [Times: user=0.48 sys=0.01, real=0.14 secs]
64901.578: [CMS-concurrent-abortable-preclean-start] 64901.578: [CMS-concurrent-abortable-preclean-start]
64901.584: [GC 64901.584: [ParNew: 5504K->571K(5568K), 0.0087270 secs] 2871220K->2866830K(3061272K), 0.0088220 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] 64901.584: [GC 64901.584: [ParNew: 5504K->571K(5568K), 0.0087270 secs] 2871220K->2866830K(3061272K), 0.0088220 secs] [Times: user=0.05 sys=0.00, real=0.01 secs]
64901.609: [GC 64901.609: [ParNew: 5512K->569K(5568K), 0.0063370 secs] 2871771K->2867322K(3061272K), 0.0064230 secs] [Times: user=0.06 sys=0.00, real=0.01 secs] 64901.609: [GC 64901.609: [ParNew: 5512K->569K(5568K), 0.0063370 secs] 2871771K->2867322K(3061272K), 0.0064230 secs] [Times: user=0.06 sys=0.00, real=0.01 secs]
64901.615: [CMS-concurrent-abortable-preclean: 0.007/0.037 secs] [Times: user=0.13 sys=0.00, real=0.03 secs] 64901.615: [CMS-concurrent-abortable-preclean: 0.007/0.037 secs] [Times: user=0.13 sys=0.00, real=0.03 secs]
64901.616: [GC[YG occupancy: 645 K (5568 K)]64901.616: [Rescan (parallel) , 0.0020210 secs]64901.618: [weak refs processing, 0.0027950 secs] [1 CMS-remark: 2866753K(3055704K)] 2867399K(3061272K), 0.0049380 secs] [Times: user=0.00 sys=0.01, real=0.01 secs] 64901.616: [GC[YG occupancy: 645 K (5568 K)]64901.616: [Rescan (parallel) , 0.0020210 secs]64901.618: [weak refs processing, 0.0027950 secs] [1 CMS-remark: 2866753K(3055704K)] 2867399K(3061272K), 0.0049380 secs] [Times: user=0.00 sys=0.01, real=0.01 secs]
64901.621: [CMS-concurrent-sweep-start] 64901.621: [CMS-concurrent-sweep-start]
</programlisting> </programlisting>
</para> </para>
@ -161,20 +161,20 @@ export HBASE_OPTS="-XX:+UseConcMarkSweepGC -verbose:gc -XX:+PrintGCDetails -XX:+
</para> </para>
<para> <para>
There are a few more minor GCs, then there is a pause at the 2nd last line: There are a few more minor GCs, then there is a pause at the 2nd last line:
<programlisting> <programlisting>
64901.616: [GC[YG occupancy: 645 K (5568 K)]64901.616: [Rescan (parallel) , 0.0020210 secs]64901.618: [weak refs processing, 0.0027950 secs] [1 CMS-remark: 2866753K(3055704K)] 2867399K(3061272K), 0.0049380 secs] [Times: user=0.00 sys=0.01, real=0.01 secs] 64901.616: [GC[YG occupancy: 645 K (5568 K)]64901.616: [Rescan (parallel) , 0.0020210 secs]64901.618: [weak refs processing, 0.0027950 secs] [1 CMS-remark: 2866753K(3055704K)] 2867399K(3061272K), 0.0049380 secs] [Times: user=0.00 sys=0.01, real=0.01 secs]
</programlisting> </programlisting>
</para> </para>
<para> <para>
The pause here is 0.0049380 seconds (aka 4.9 milliseconds) to 'remark' the heap. The pause here is 0.0049380 seconds (aka 4.9 milliseconds) to 'remark' the heap.
</para> </para>
<para> <para>
At this point the sweep starts, and you can watch the heap size go down: At this point the sweep starts, and you can watch the heap size go down:
<programlisting> <programlisting>
64901.637: [GC 64901.637: [ParNew: 5501K->569K(5568K), 0.0097350 secs] 2871958K->2867441K(3061272K), 0.0098370 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] 64901.637: [GC 64901.637: [ParNew: 5501K->569K(5568K), 0.0097350 secs] 2871958K->2867441K(3061272K), 0.0098370 secs] [Times: user=0.05 sys=0.00, real=0.01 secs]
... lines removed ... ... lines removed ...
64904.936: [GC 64904.936: [ParNew: 5532K->568K(5568K), 0.0070720 secs] 1365024K->1360689K(3061272K), 0.0071930 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] 64904.936: [GC 64904.936: [ParNew: 5532K->568K(5568K), 0.0070720 secs] 1365024K->1360689K(3061272K), 0.0071930 secs] [Times: user=0.05 sys=0.00, real=0.01 secs]
64904.953: [CMS-concurrent-sweep: 2.030/3.332 secs] [Times: user=9.57 sys=0.26, real=3.33 secs] 64904.953: [CMS-concurrent-sweep: 2.030/3.332 secs] [Times: user=9.57 sys=0.26, real=3.33 secs]
</programlisting> </programlisting>
At this point, the CMS sweep took 3.332 seconds, and heap went from about ~ 2.8 GB to 1.3 GB (approximate). At this point, the CMS sweep took 3.332 seconds, and heap went from about ~ 2.8 GB to 1.3 GB (approximate).
</para> </para>
@ -186,14 +186,14 @@ export HBASE_OPTS="-XX:+UseConcMarkSweepGC -verbose:gc -XX:+PrintGCDetails -XX:+
</para> </para>
<para> <para>
Add this to HBASE_OPTS: Add this to HBASE_OPTS:
<programlisting> <programlisting>
export HBASE_OPTS="-XX:NewSize=64m -XX:MaxNewSize=64m &lt;cms options from above&gt; &lt;gc logging options from above&gt;" export HBASE_OPTS="-XX:NewSize=64m -XX:MaxNewSize=64m &lt;cms options from above&gt; &lt;gc logging options from above&gt;"
</programlisting> </programlisting>
</para> </para>
<para> <para>
For more information on GC pauses, see the <link xlink:href="http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/">3 part blog post</link> by Todd Lipcon For more information on GC pauses, see the <link xlink:href="http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/">3 part blog post</link> by Todd Lipcon
and <xref linkend="gcpause" /> above. and <xref linkend="gcpause" /> above.
</para> </para>
</section> </section>
</section> </section>
<section xml:id="trouble.resources"> <section xml:id="trouble.resources">
@ -201,7 +201,7 @@ export HBASE_OPTS="-XX:NewSize=64m -XX:MaxNewSize=64m &lt;cms options from above
<section xml:id="trouble.resources.searchhadoop"> <section xml:id="trouble.resources.searchhadoop">
<title>search-hadoop.com</title> <title>search-hadoop.com</title>
<para> <para>
<link xlink:href="http://search-hadoop.com">search-hadoop.com</link> indexes all the mailing lists and is great for historical searches. <link xlink:href="http://search-hadoop.com">search-hadoop.com</link> indexes all the mailing lists and is great for historical searches.
Search here first when you have an issue as its more than likely someone has already had your problem. Search here first when you have an issue as its more than likely someone has already had your problem.
</para> </para>
</section> </section>
@ -212,7 +212,7 @@ export HBASE_OPTS="-XX:NewSize=64m -XX:MaxNewSize=64m &lt;cms options from above
is generally used for questions on released versions of Apache HBase. Before going to the mailing list, make sure your is generally used for questions on released versions of Apache HBase. Before going to the mailing list, make sure your
question has not already been answered by searching the mailing list archives first. Use question has not already been answered by searching the mailing list archives first. Use
<xref linkend="trouble.resources.searchhadoop" />. <xref linkend="trouble.resources.searchhadoop" />.
Take some time crafting your question<footnote><para>See <link xlink="http://www.mikeash.com/getting_answers.html">Getting Answers</link></para></footnote>; a quality question that includes all context and Take some time crafting your question<footnote><para>See <link xlink="http://www.mikeash.com/getting_answers.html">Getting Answers</link></para></footnote>; a quality question that includes all context and
exhibits evidence the author has tried to find answers in the manual and out on lists exhibits evidence the author has tried to find answers in the manual and out on lists
is more likely to get a prompt response. is more likely to get a prompt response.
</para> </para>
@ -236,7 +236,7 @@ export HBASE_OPTS="-XX:NewSize=64m -XX:MaxNewSize=64m &lt;cms options from above
<title>Master Web Interface</title> <title>Master Web Interface</title>
<para>The Master starts a web-interface on port 60010 by default. <para>The Master starts a web-interface on port 60010 by default.
</para> </para>
<para>The Master web UI lists created tables and their definition (e.g., ColumnFamilies, blocksize, etc.). Additionally, <para>The Master web UI lists created tables and their definition (e.g., ColumnFamilies, blocksize, etc.). Additionally,
the available RegionServers in the cluster are listed along with selected high-level metrics (requests, number of regions, usedHeap, maxHeap). the available RegionServers in the cluster are listed along with selected high-level metrics (requests, number of regions, usedHeap, maxHeap).
The Master web UI allows navigation to each RegionServer's web UI. The Master web UI allows navigation to each RegionServer's web UI.
</para> </para>
@ -263,13 +263,13 @@ export HBASE_OPTS="-XX:NewSize=64m -XX:MaxNewSize=64m &lt;cms options from above
ls path [watch] ls path [watch]
set path data [version] set path data [version]
delquota [-n|-b] path delquota [-n|-b] path
quit quit
printwatches on|off printwatches on|off
create [-s] [-e] path data acl create [-s] [-e] path data acl
stat path [watch] stat path [watch]
close close
ls2 path [watch] ls2 path [watch]
history history
listquota path listquota path
setAcl path acl setAcl path acl
getAcl path getAcl path
@ -292,7 +292,7 @@ export HBASE_OPTS="-XX:NewSize=64m -XX:MaxNewSize=64m &lt;cms options from above
</section> </section>
<section xml:id="trouble.tools.top"> <section xml:id="trouble.tools.top">
<title>top</title> <title>top</title>
<para> <para>
<code>top</code> is probably one of the most important tool when first trying to see whats running on a machine and how the resources are consumed. Heres an example from production system: <code>top</code> is probably one of the most important tool when first trying to see whats running on a machine and how the resources are consumed. Heres an example from production system:
<programlisting> <programlisting>
top - 14:46:59 up 39 days, 11:55, 1 user, load average: 3.75, 3.57, 3.84 top - 14:46:59 up 39 days, 11:55, 1 user, load average: 3.75, 3.57, 3.84
@ -300,10 +300,10 @@ Tasks: 309 total, 1 running, 308 sleeping, 0 stopped, 0 zombie
Cpu(s): 4.5%us, 1.6%sy, 0.0%ni, 91.7%id, 1.4%wa, 0.1%hi, 0.6%si, 0.0%st Cpu(s): 4.5%us, 1.6%sy, 0.0%ni, 91.7%id, 1.4%wa, 0.1%hi, 0.6%si, 0.0%st
Mem: 24414432k total, 24296956k used, 117476k free, 7196k buffers Mem: 24414432k total, 24296956k used, 117476k free, 7196k buffers
Swap: 16008732k total, 14348k used, 15994384k free, 11106908k cached Swap: 16008732k total, 14348k used, 15994384k free, 11106908k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
15558 hadoop 18 -2 3292m 2.4g 3556 S 79 10.4 6523:52 java 15558 hadoop 18 -2 3292m 2.4g 3556 S 79 10.4 6523:52 java
13268 hadoop 18 -2 8967m 8.2g 4104 S 21 35.1 5170:30 java 13268 hadoop 18 -2 8967m 8.2g 4104 S 21 35.1 5170:30 java
8895 hadoop 18 -2 1581m 497m 3420 S 11 2.1 4002:32 java 8895 hadoop 18 -2 1581m 497m 3420 S 11 2.1 4002:32 java
</programlisting> </programlisting>
@ -351,7 +351,7 @@ hadoop@sv4borg12:~$ jps
<programlisting> <programlisting>
hadoop@sv4borg12:~$ ps aux | grep HRegionServer hadoop@sv4borg12:~$ ps aux | grep HRegionServer
hadoop 17789 155 35.2 9067824 8604364 ? S&lt;l Mar04 9855:48 /usr/java/jdk1.6.0_14/bin/java -Xmx8000m -XX:+DoEscapeAnalysis -XX:+AggressiveOpts -XX:+UseConcMarkSweepGC -XX:NewSize=64m -XX:MaxNewSize=64m -XX:CMSInitiatingOccupancyFraction=88 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:/export1/hadoop/logs/gc-hbase.log -Dcom.sun.management.jmxremote.port=10102 -Dcom.sun.management.jmxremote.authenticate=true -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.password.file=/home/hadoop/hbase/conf/jmxremote.password -Dcom.sun.management.jmxremote -Dhbase.log.dir=/export1/hadoop/logs -Dhbase.log.file=hbase-hadoop-regionserver-sv4borg12.log -Dhbase.home.dir=/home/hadoop/hbase -Dhbase.id.str=hadoop -Dhbase.root.logger=INFO,DRFA -Djava.library.path=/home/hadoop/hbase/lib/native/Linux-amd64-64 -classpath /home/hadoop/hbase/bin/../conf:[many jars]:/home/hadoop/hadoop/conf org.apache.hadoop.hbase.regionserver.HRegionServer start hadoop 17789 155 35.2 9067824 8604364 ? S&lt;l Mar04 9855:48 /usr/java/jdk1.6.0_14/bin/java -Xmx8000m -XX:+DoEscapeAnalysis -XX:+AggressiveOpts -XX:+UseConcMarkSweepGC -XX:NewSize=64m -XX:MaxNewSize=64m -XX:CMSInitiatingOccupancyFraction=88 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:/export1/hadoop/logs/gc-hbase.log -Dcom.sun.management.jmxremote.port=10102 -Dcom.sun.management.jmxremote.authenticate=true -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.password.file=/home/hadoop/hbase/conf/jmxremote.password -Dcom.sun.management.jmxremote -Dhbase.log.dir=/export1/hadoop/logs -Dhbase.log.file=hbase-hadoop-regionserver-sv4borg12.log -Dhbase.home.dir=/home/hadoop/hbase -Dhbase.id.str=hadoop -Dhbase.root.logger=INFO,DRFA -Djava.library.path=/home/hadoop/hbase/lib/native/Linux-amd64-64 -classpath /home/hadoop/hbase/bin/../conf:[many jars]:/home/hadoop/hadoop/conf org.apache.hadoop.hbase.regionserver.HRegionServer start
</programlisting> </programlisting>
</para> </para>
</section> </section>
<section xml:id="trouble.tools.jstack"> <section xml:id="trouble.tools.jstack">
@ -371,7 +371,7 @@ hadoop 17789 155 35.2 9067824 8604364 ? S&lt;l Mar04 9855:48 /usr/java/j
at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:395) at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:395)
at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:647) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:647)
at java.lang.Thread.run(Thread.java:619) at java.lang.Thread.run(Thread.java:619)
The MemStore flusher thread that is currently flushing to a file: The MemStore flusher thread that is currently flushing to a file:
"regionserver60020.cacheFlusher" daemon prio=10 tid=0x0000000040f4e000 nid=0x45eb in Object.wait() [0x00007f16b5b86000..0x00007f16b5b87af0] "regionserver60020.cacheFlusher" daemon prio=10 tid=0x0000000040f4e000 nid=0x45eb in Object.wait() [0x00007f16b5b86000..0x00007f16b5b87af0]
java.lang.Thread.State: WAITING (on object monitor) java.lang.Thread.State: WAITING (on object monitor)
@ -444,7 +444,7 @@ hadoop 17789 155 35.2 9067824 8604364 ? S&lt;l Mar04 9855:48 /usr/java/j
</para> </para>
<para> <para>
A thread that receives data from HDFS: A thread that receives data from HDFS:
<programlisting> <programlisting>
"IPC Client (47) connection to sv4borg9/10.4.24.40:9000 from hadoop" daemon prio=10 tid=0x00007f16a02d0000 nid=0x4fa3 runnable [0x00007f16b517d000..0x00007f16b517dbf0] "IPC Client (47) connection to sv4borg9/10.4.24.40:9000 from hadoop" daemon prio=10 tid=0x00007f16a02d0000 nid=0x4fa3 runnable [0x00007f16b517d000..0x00007f16b517dbf0]
java.lang.Thread.State: RUNNABLE java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
@ -509,27 +509,27 @@ hadoop 17789 155 35.2 9067824 8604364 ? S&lt;l Mar04 9855:48 /usr/java/j
</section> </section>
<section xml:id="trouble.tools.clustersshtop"> <section xml:id="trouble.tools.clustersshtop">
<title>clusterssh+top</title> <title>clusterssh+top</title>
<para> <para>
clusterssh+top, its like a poor mans monitoring system and it can be quite useful when you have only a few machines as its very easy to setup. Starting clusterssh will give you one terminal per machine and another terminal in which whatever you type will be retyped in every window. This means that you can type “top” once and it will start it for all of your machines at the same time giving you full view of the current state of your cluster. You can also tail all the logs at the same time, edit files, etc. clusterssh+top, its like a poor mans monitoring system and it can be quite useful when you have only a few machines as its very easy to setup. Starting clusterssh will give you one terminal per machine and another terminal in which whatever you type will be retyped in every window. This means that you can type “top” once and it will start it for all of your machines at the same time giving you full view of the current state of your cluster. You can also tail all the logs at the same time, edit files, etc.
</para> </para>
</section> </section>
</section> </section>
</section> </section>
<section xml:id="trouble.client"> <section xml:id="trouble.client">
<title>Client</title> <title>Client</title>
<para>For more information on the HBase client, see <xref linkend="client"/>. <para>For more information on the HBase client, see <xref linkend="client"/>.
</para> </para>
<section xml:id="trouble.client.scantimeout"> <section xml:id="trouble.client.scantimeout">
<title>ScannerTimeoutException or UnknownScannerException</title> <title>ScannerTimeoutException or UnknownScannerException</title>
<para>This is thrown if the time between RPC calls from the client to RegionServer exceeds the scan timeout. <para>This is thrown if the time between RPC calls from the client to RegionServer exceeds the scan timeout.
For example, if <code>Scan.setCaching</code> is set to 500, then there will be an RPC call to fetch the next batch of rows every 500 <code>.next()</code> calls on the ResultScanner For example, if <code>Scan.setCaching</code> is set to 500, then there will be an RPC call to fetch the next batch of rows every 500 <code>.next()</code> calls on the ResultScanner
because data is being transferred in blocks of 500 rows to the client. Reducing the setCaching value may be an option, but setting this value too low makes for inefficient because data is being transferred in blocks of 500 rows to the client. Reducing the setCaching value may be an option, but setting this value too low makes for inefficient
processing on numbers of rows. processing on numbers of rows.
</para> </para>
<para>See <xref linkend="perf.hbase.client.caching"/>. <para>See <xref linkend="perf.hbase.client.caching"/>.
</para> </para>
</section> </section>
<section xml:id="trouble.client.lease.exception"> <section xml:id="trouble.client.lease.exception">
<title><classname>LeaseException</classname> when calling <classname>Scanner.next</classname></title> <title><classname>LeaseException</classname> when calling <classname>Scanner.next</classname></title>
<para> <para>
@ -546,27 +546,27 @@ Harsh J investigated the issue as part of the mailing list thread
<title>Shell or client application throws lots of scary exceptions during normal operation</title> <title>Shell or client application throws lots of scary exceptions during normal operation</title>
<para>Since 0.20.0 the default log level for <code>org.apache.hadoop.hbase.*</code>is DEBUG. </para> <para>Since 0.20.0 the default log level for <code>org.apache.hadoop.hbase.*</code>is DEBUG. </para>
<para> <para>
On your clients, edit <filename>$HBASE_HOME/conf/log4j.properties</filename> and change this: <code>log4j.logger.org.apache.hadoop.hbase=DEBUG</code> to this: <code>log4j.logger.org.apache.hadoop.hbase=INFO</code>, or even <code>log4j.logger.org.apache.hadoop.hbase=WARN</code>. On your clients, edit <filename>$HBASE_HOME/conf/log4j.properties</filename> and change this: <code>log4j.logger.org.apache.hadoop.hbase=DEBUG</code> to this: <code>log4j.logger.org.apache.hadoop.hbase=INFO</code>, or even <code>log4j.logger.org.apache.hadoop.hbase=WARN</code>.
</para> </para>
</section> </section>
<section xml:id="trouble.client.longpauseswithcompression"> <section xml:id="trouble.client.longpauseswithcompression">
<title>Long Client Pauses With Compression</title> <title>Long Client Pauses With Compression</title>
<para>This is a fairly frequent question on the Apache HBase dist-list. The scenario is that a client is typically inserting a lot of data into a <para>This is a fairly frequent question on the Apache HBase dist-list. The scenario is that a client is typically inserting a lot of data into a
relatively un-optimized HBase cluster. Compression can exacerbate the pauses, although it is not the source of the problem.</para> relatively un-optimized HBase cluster. Compression can exacerbate the pauses, although it is not the source of the problem.</para>
<para>See <xref linkend="precreate.regions"/> on the pattern for pre-creating regions and confirm that the table isn't starting with a single region.</para> <para>See <xref linkend="precreate.regions"/> on the pattern for pre-creating regions and confirm that the table isn't starting with a single region.</para>
<para>See <xref linkend="perf.configurations"/> for cluster configuration, particularly <code>hbase.hstore.blockingStoreFiles</code>, <code>hbase.hregion.memstore.block.multiplier</code>, <para>See <xref linkend="perf.configurations"/> for cluster configuration, particularly <code>hbase.hstore.blockingStoreFiles</code>, <code>hbase.hregion.memstore.block.multiplier</code>,
<code>MAX_FILESIZE</code> (region size), and <code>MEMSTORE_FLUSHSIZE.</code> </para> <code>MAX_FILESIZE</code> (region size), and <code>MEMSTORE_FLUSHSIZE.</code> </para>
<para>A slightly longer explanation of why pauses can happen is as follows: Puts are sometimes blocked on the MemStores which are blocked by the flusher thread which is blocked because there are <para>A slightly longer explanation of why pauses can happen is as follows: Puts are sometimes blocked on the MemStores which are blocked by the flusher thread which is blocked because there are
too many files to compact because the compactor is given too many small files to compact and has to compact the same data repeatedly. This situation can occur even with minor compactions. too many files to compact because the compactor is given too many small files to compact and has to compact the same data repeatedly. This situation can occur even with minor compactions.
Compounding this situation, Apache HBase doesn't compress data in memory. Thus, the 64MB that lives in the MemStore could become a 6MB file after compression - which results in a smaller StoreFile. The upside is that Compounding this situation, Apache HBase doesn't compress data in memory. Thus, the 64MB that lives in the MemStore could become a 6MB file after compression - which results in a smaller StoreFile. The upside is that
more data is packed into the same region, but performance is achieved by being able to write larger files - which is why HBase waits until the flushize before writing a new StoreFile. And smaller StoreFiles more data is packed into the same region, but performance is achieved by being able to write larger files - which is why HBase waits until the flushize before writing a new StoreFile. And smaller StoreFiles
become targets for compaction. Without compression the files are much bigger and don't need as much compaction, however this is at the expense of I/O. become targets for compaction. Without compression the files are much bigger and don't need as much compaction, however this is at the expense of I/O.
</para> </para>
<para> <para>
For additional information, see this thread on <link xlink:href="http://search-hadoop.com/m/WUnLM6ojHm1/Long+client+pauses+with+compression&amp;subj=Long+client+pauses+with+compression">Long client pauses with compression</link>. For additional information, see this thread on <link xlink:href="http://search-hadoop.com/m/WUnLM6ojHm1/Long+client+pauses+with+compression&amp;subj=Long+client+pauses+with+compression">Long client pauses with compression</link>.
</para> </para>
</section> </section>
<section xml:id="trouble.client.zookeeper"> <section xml:id="trouble.client.zookeeper">
<title>ZooKeeper Client Connection Errors</title> <title>ZooKeeper Client Connection Errors</title>
<para>Errors like this... <para>Errors like this...
@ -588,11 +588,11 @@ Harsh J investigated the issue as part of the mailing list thread
11/07/05 11:26:45 INFO zookeeper.ClientCnxn: Opening socket connection to 11/07/05 11:26:45 INFO zookeeper.ClientCnxn: Opening socket connection to
server localhost/127.0.0.1:2181 server localhost/127.0.0.1:2181
</programlisting> </programlisting>
... are either due to ZooKeeper being down, or unreachable due to network issues. ... are either due to ZooKeeper being down, or unreachable due to network issues.
</para> </para>
<para>The utility <xref linkend="trouble.tools.builtin.zkcli"/> may help investigate ZooKeeper issues. <para>The utility <xref linkend="trouble.tools.builtin.zkcli"/> may help investigate ZooKeeper issues.
</para> </para>
</section> </section>
<section xml:id="trouble.client.oome.directmemory.leak"> <section xml:id="trouble.client.oome.directmemory.leak">
<title>Client running out of memory though heap size seems to be stable (but the off-heap/direct heap keeps growing)</title> <title>Client running out of memory though heap size seems to be stable (but the off-heap/direct heap keeps growing)</title>
<para> <para>
@ -607,15 +607,15 @@ it a bit hefty. You want to make this setting client-side only especially if y
server-side off-heap cache since this feature depends on being able to use big direct buffers (You may have to keep server-side off-heap cache since this feature depends on being able to use big direct buffers (You may have to keep
separate client-side and server-side config dirs). separate client-side and server-side config dirs).
</para> </para>
</section> </section>
<section xml:id="trouble.client.slowdown.admin"> <section xml:id="trouble.client.slowdown.admin">
<title>Client Slowdown When Calling Admin Methods (flush, compact, etc.)</title> <title>Client Slowdown When Calling Admin Methods (flush, compact, etc.)</title>
<para> <para>
This is a client issue fixed by <link xlink:href="https://issues.apache.org/jira/browse/HBASE-5073">HBASE-5073</link> in 0.90.6. This is a client issue fixed by <link xlink:href="https://issues.apache.org/jira/browse/HBASE-5073">HBASE-5073</link> in 0.90.6.
There was a ZooKeeper leak in the client and the client was getting pummeled by ZooKeeper events with each additional There was a ZooKeeper leak in the client and the client was getting pummeled by ZooKeeper events with each additional
invocation of the admin API. invocation of the admin API.
</para> </para>
</section> </section>
<section xml:id="trouble.client.security.rpc"> <section xml:id="trouble.client.security.rpc">
<title>Secure Client Cannot Connect ([Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)])</title> <title>Secure Client Cannot Connect ([Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)])</title>
@ -637,16 +637,16 @@ Finally, depending on your Kerberos configuration, you may need to install the <
<para> <para>
You may also need to download the <link xlink:href="http://www.oracle.com/technetwork/java/javase/downloads/jce-6-download-429243.html">unlimited strength JCE policy files</link>. Uncompress and extract the downloaded file, and install the policy jars into &lt;java-home&gt;/lib/security. You may also need to download the <link xlink:href="http://www.oracle.com/technetwork/java/javase/downloads/jce-6-download-429243.html">unlimited strength JCE policy files</link>. Uncompress and extract the downloaded file, and install the policy jars into &lt;java-home&gt;/lib/security.
</para> </para>
</section> </section>
</section> </section>
<section xml:id="trouble.mapreduce"> <section xml:id="trouble.mapreduce">
<title>MapReduce</title> <title>MapReduce</title>
<section xml:id="trouble.mapreduce.local"> <section xml:id="trouble.mapreduce.local">
<title>You Think You're On The Cluster, But You're Actually Local</title> <title>You Think You're On The Cluster, But You're Actually Local</title>
<para>This following stacktrace happened using <code>ImportTsv</code>, but things like this <para>This following stacktrace happened using <code>ImportTsv</code>, but things like this
can happen on any job with a mis-configuration. can happen on any job with a mis-configuration.
<programlisting> <programlisting>
WARN mapred.LocalJobRunner: job_local_0001 WARN mapred.LocalJobRunner: job_local_0001
java.lang.IllegalArgumentException: Can't read partitions file java.lang.IllegalArgumentException: Can't read partitions file
@ -671,17 +671,17 @@ Caused by: java.io.FileNotFoundException: File _partition.lst does not exist.
</programlisting> </programlisting>
LocalJobRunner means the job is running locally, not on the cluster. LocalJobRunner means the job is running locally, not on the cluster.
</para> </para>
<para>See <para>See
<link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#classpath"> <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#classpath">
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#classpath</link> for more http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#classpath</link> for more
information on HBase MapReduce jobs and classpaths. information on HBase MapReduce jobs and classpaths.
</para> </para>
</section> </section>
</section> </section>
<section xml:id="trouble.namenode"> <section xml:id="trouble.namenode">
<title>NameNode</title> <title>NameNode</title>
<para>For more information on the NameNode, see <xref linkend="arch.hdfs"/>. <para>For more information on the NameNode, see <xref linkend="arch.hdfs"/>.
</para> </para>
<section xml:id="trouble.namenode.disk"> <section xml:id="trouble.namenode.disk">
<title>HDFS Utilization of Tables and Regions</title> <title>HDFS Utilization of Tables and Regions</title>
@ -691,7 +691,7 @@ Caused by: java.io.FileNotFoundException: File _partition.lst does not exist.
<para><programlisting>hadoop fs -du /hbase/myTable</programlisting> ...returns a list of the regions under the HBase table 'myTable' and their disk utilization. </para> <para><programlisting>hadoop fs -du /hbase/myTable</programlisting> ...returns a list of the regions under the HBase table 'myTable' and their disk utilization. </para>
<para>For more information on HDFS shell commands, see the <link xlink:href="http://hadoop.apache.org/common/docs/current/file_system_shell.html">HDFS FileSystem Shell documentation</link>. <para>For more information on HDFS shell commands, see the <link xlink:href="http://hadoop.apache.org/common/docs/current/file_system_shell.html">HDFS FileSystem Shell documentation</link>.
</para> </para>
</section> </section>
<section xml:id="trouble.namenode.hbase.objects"> <section xml:id="trouble.namenode.hbase.objects">
<title>Browsing HDFS for HBase Objects</title> <title>Browsing HDFS for HBase Objects</title>
<para>Somtimes it will be necessary to explore the HBase objects that exist on HDFS. These objects could include the WALs (Write Ahead Logs), tables, regions, StoreFiles, etc. <para>Somtimes it will be necessary to explore the HBase objects that exist on HDFS. These objects could include the WALs (Write Ahead Logs), tables, regions, StoreFiles, etc.
@ -709,30 +709,30 @@ Caused by: java.io.FileNotFoundException: File _partition.lst does not exist.
<para>The HDFS directory structure of HBase WAL is.. <para>The HDFS directory structure of HBase WAL is..
<programlisting> <programlisting>
<filename>/hbase</filename> <filename>/hbase</filename>
<filename>/.logs</filename> <filename>/.logs</filename>
<filename>/&lt;RegionServer&gt;</filename> (RegionServers) <filename>/&lt;RegionServer&gt;</filename> (RegionServers)
<filename>/&lt;HLog&gt;</filename> (WAL HLog files for the RegionServer) <filename>/&lt;HLog&gt;</filename> (WAL HLog files for the RegionServer)
</programlisting> </programlisting>
</para> </para>
<para>See the <link xlink:href="see http://hadoop.apache.org/common/docs/current/hdfs_user_guide.html">HDFS User Guide</link> for other non-shell diagnostic <para>See the <link xlink:href="see http://hadoop.apache.org/common/docs/current/hdfs_user_guide.html">HDFS User Guide</link> for other non-shell diagnostic
utilities like <code>fsck</code>. utilities like <code>fsck</code>.
</para> </para>
<section xml:id="trouble.namenode.uncompaction"> <section xml:id="trouble.namenode.uncompaction">
<title>Use Cases</title> <title>Use Cases</title>
<para>Two common use-cases for querying HDFS for HBase objects is research the degree of uncompaction of a table. If there are a large number of StoreFiles for each ColumnFamily it could <para>Two common use-cases for querying HDFS for HBase objects is research the degree of uncompaction of a table. If there are a large number of StoreFiles for each ColumnFamily it could
indicate the need for a major compaction. Additionally, after a major compaction if the resulting StoreFile is "small" it could indicate the need for a reduction of ColumnFamilies for indicate the need for a major compaction. Additionally, after a major compaction if the resulting StoreFile is "small" it could indicate the need for a reduction of ColumnFamilies for
the table. the table.
</para> </para>
</section> </section>
</section> </section>
</section> </section>
<section xml:id="trouble.network"> <section xml:id="trouble.network">
<title>Network</title> <title>Network</title>
<section xml:id="trouble.network.spikes"> <section xml:id="trouble.network.spikes">
<title>Network Spikes</title> <title>Network Spikes</title>
<para>If you are seeing periodic network spikes you might want to check the <code>compactionQueues</code> to see if major <para>If you are seeing periodic network spikes you might want to check the <code>compactionQueues</code> to see if major
compactions are happening. compactions are happening.
</para> </para>
<para>See <xref linkend="managed.compactions"/> for more information on managing compactions. <para>See <xref linkend="managed.compactions"/> for more information on managing compactions.
@ -748,12 +748,12 @@ Caused by: java.io.FileNotFoundException: File _partition.lst does not exist.
<para>Are all the network interfaces functioning correctly? Are you sure? See the Troubleshooting Case Study in <xref linkend="trouble.casestudy"/>. <para>Are all the network interfaces functioning correctly? Are you sure? See the Troubleshooting Case Study in <xref linkend="trouble.casestudy"/>.
</para> </para>
</section> </section>
</section> </section>
<section xml:id="trouble.rs"> <section xml:id="trouble.rs">
<title>RegionServer</title> <title>RegionServer</title>
<para>For more information on the RegionServers, see <xref linkend="regionserver.arch"/>. <para>For more information on the RegionServers, see <xref linkend="regionserver.arch"/>.
</para> </para>
<section xml:id="trouble.rs.startup"> <section xml:id="trouble.rs.startup">
<title>Startup Errors</title> <title>Startup Errors</title>
@ -761,9 +761,9 @@ Caused by: java.io.FileNotFoundException: File _partition.lst does not exist.
<title>Master Starts, But RegionServers Do Not</title> <title>Master Starts, But RegionServers Do Not</title>
<para>The Master believes the RegionServers have the IP of 127.0.0.1 - which is localhost and resolves to the master's own localhost. <para>The Master believes the RegionServers have the IP of 127.0.0.1 - which is localhost and resolves to the master's own localhost.
</para> </para>
<para>The RegionServers are erroneously informing the Master that their IP addresses are 127.0.0.1. <para>The RegionServers are erroneously informing the Master that their IP addresses are 127.0.0.1.
</para> </para>
<para>Modify <filename>/etc/hosts</filename> on the region servers, from... <para>Modify <filename>/etc/hosts</filename> on the region servers, from...
<programlisting> <programlisting>
# Do not remove the following line, or various programs # Do not remove the following line, or various programs
# that require network functionality will fail. # that require network functionality will fail.
@ -779,7 +779,7 @@ Caused by: java.io.FileNotFoundException: File _partition.lst does not exist.
</programlisting> </programlisting>
</para> </para>
</section> </section>
<section xml:id="trouble.rs.startup.compression"> <section xml:id="trouble.rs.startup.compression">
<title>Compression Link Errors</title> <title>Compression Link Errors</title>
<para> <para>
@ -793,8 +793,8 @@ java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path
</programlisting> </programlisting>
.. then there is a path issue with the compression libraries. See the Configuration section on <link linkend="lzo.compression">LZO compression configuration</link>. .. then there is a path issue with the compression libraries. See the Configuration section on <link linkend="lzo.compression">LZO compression configuration</link>.
</para> </para>
</section> </section>
</section> </section>
<section xml:id="trouble.rs.runtime"> <section xml:id="trouble.rs.runtime">
<title>Runtime Errors</title> <title>Runtime Errors</title>
@ -807,7 +807,7 @@ java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path
Adding <code>-XX:+UseMembar</code> to the HBase <varname>HBASE_OPTS</varname> in <filename>conf/hbase-env.sh</filename> Adding <code>-XX:+UseMembar</code> to the HBase <varname>HBASE_OPTS</varname> in <filename>conf/hbase-env.sh</filename>
may fix it. may fix it.
</para> </para>
<para>Also, are you using <xref linkend="client.rowlocks"/>? These are discouraged because they can lock up the <para>Also, are you using <xref linkend="client.rowlocks"/>? These are discouraged because they can lock up the
RegionServers if not managed properly. RegionServers if not managed properly.
</para> </para>
</section> </section>
@ -816,7 +816,7 @@ java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path
<para> <para>
If you see log messages like this... If you see log messages like this...
<programlisting> <programlisting>
2010-09-13 01:24:17,336 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 2010-09-13 01:24:17,336 WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
Disk-related IOException in BlockReceiver constructor. Cause is java.io.IOException: Too many open files Disk-related IOException in BlockReceiver constructor. Cause is java.io.IOException: Too many open files
at java.io.UnixFileSystem.createFileExclusively(Native Method) at java.io.UnixFileSystem.createFileExclusively(Native Method)
at java.io.File.createNewFile(File.java:883) at java.io.File.createNewFile(File.java:883)
@ -847,7 +847,7 @@ Disk-related IOException in BlockReceiver constructor. Cause is java.io.IOExcept
<programlisting> <programlisting>
2009-02-24 10:01:33,516 WARN org.apache.hadoop.hbase.util.Sleeper: We slept xxx ms, ten times longer than scheduled: 10000 2009-02-24 10:01:33,516 WARN org.apache.hadoop.hbase.util.Sleeper: We slept xxx ms, ten times longer than scheduled: 10000
2009-02-24 10:01:33,516 WARN org.apache.hadoop.hbase.util.Sleeper: We slept xxx ms, ten times longer than scheduled: 15000 2009-02-24 10:01:33,516 WARN org.apache.hadoop.hbase.util.Sleeper: We slept xxx ms, ten times longer than scheduled: 15000
2009-02-24 10:01:36,472 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: unable to report to master for xxx milliseconds - retrying 2009-02-24 10:01:36,472 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: unable to report to master for xxx milliseconds - retrying
</programlisting> </programlisting>
... or see full GC compactions then you may be experiencing full GC's. ... or see full GC compactions then you may be experiencing full GC's.
</para> </para>
@ -878,12 +878,12 @@ java.io.IOException: Session Expired
at org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589) at org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)
at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709) at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expired ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expired
</programlisting> </programlisting>
<para> <para>
The JVM is doing a long running garbage collecting which is pausing every threads (aka "stop the world"). The JVM is doing a long running garbage collecting which is pausing every threads (aka "stop the world").
Since the RegionServer's local ZooKeeper client cannot send heartbeats, the session times out. Since the RegionServer's local ZooKeeper client cannot send heartbeats, the session times out.
By design, we shut down any node that isn't able to contact the ZooKeeper ensemble after getting a timeout so that it stops serving data that may already be assigned elsewhere. By design, we shut down any node that isn't able to contact the ZooKeeper ensemble after getting a timeout so that it stops serving data that may already be assigned elsewhere.
</para> </para>
<para> <para>
<itemizedlist> <itemizedlist>
@ -892,7 +892,7 @@ ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expi
<listitem>Make sure you are not CPU starving the RegionServer thread. For example, if you are running a MapReduce job using 6 CPU-intensive tasks on a machine with 4 cores, you are probably starving the RegionServer enough to create longer garbage collection pauses.</listitem> <listitem>Make sure you are not CPU starving the RegionServer thread. For example, if you are running a MapReduce job using 6 CPU-intensive tasks on a machine with 4 cores, you are probably starving the RegionServer enough to create longer garbage collection pauses.</listitem>
<listitem>Increase the ZooKeeper session timeout</listitem> <listitem>Increase the ZooKeeper session timeout</listitem>
</itemizedlist> </itemizedlist>
If you wish to increase the session timeout, add the following to your <filename>hbase-site.xml</filename> to increase the timeout from the default of 60 seconds to 120 seconds. If you wish to increase the session timeout, add the following to your <filename>hbase-site.xml</filename> to increase the timeout from the default of 60 seconds to 120 seconds.
<programlisting> <programlisting>
&lt;property&gt; &lt;property&gt;
&lt;name&gt;zookeeper.session.timeout&lt;/name&gt; &lt;name&gt;zookeeper.session.timeout&lt;/name&gt;
@ -906,8 +906,8 @@ ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expi
</para> </para>
<para> <para>
Be aware that setting a higher timeout means that the regions served by a failed RegionServer will take at least Be aware that setting a higher timeout means that the regions served by a failed RegionServer will take at least
that amount of time to be transfered to another RegionServer. For a production system serving live requests, we would instead that amount of time to be transfered to another RegionServer. For a production system serving live requests, we would instead
recommend setting it lower than 1 minute and over-provision your cluster in order the lower the memory load on each machines (hence having recommend setting it lower than 1 minute and over-provision your cluster in order the lower the memory load on each machines (hence having
less garbage to collect per machine). less garbage to collect per machine).
</para> </para>
<para> <para>
@ -948,35 +948,41 @@ ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expi
</para> </para>
</section> </section>
</section> </section>
<section xml:id="trouble.rs.shutdown"> <section xml:id="trouble.rs.shutdown">
<title>Shutdown Errors</title> <title>Shutdown Errors</title>
</section> </section>
</section> </section>
<section xml:id="trouble.master"> <section xml:id="trouble.master">
<title>Master</title> <title>Master</title>
<para>For more information on the Master, see <xref linkend="master"/>. <para>For more information on the Master, see <xref linkend="master"/>.
</para> </para>
<section xml:id="trouble.master.startup"> <section xml:id="trouble.master.startup">
<title>Startup Errors</title> <title>Startup Errors</title>
<section xml:id="trouble.master.startup.migration"> <section xml:id="trouble.master.startup.migration">
<title>Master says that you need to run the hbase migrations script</title> <title>Master says that you need to run the hbase migrations script</title>
<para>Upon running that, the hbase migrations script says no files in root directory.</para> <para>Upon running that, the hbase migrations script says no files in root directory.</para>
<para>HBase expects the root directory to either not exist, or to have already been initialized by hbase running a previous time. If you create a new directory for HBase using Hadoop DFS, this error will occur. <para>HBase expects the root directory to either not exist, or to have already been initialized by hbase running a previous time. If you create a new directory for HBase using Hadoop DFS, this error will occur.
Make sure the HBase root directory does not currently exist or has been initialized by a previous run of HBase. Sure fire solution is to just use Hadoop dfs to delete the HBase root and let HBase create and initialize the directory itself. Make sure the HBase root directory does not currently exist or has been initialized by a previous run of HBase. Sure fire solution is to just use Hadoop dfs to delete the HBase root and let HBase create and initialize the directory itself.
</para> </para>
</section> </section>
<section xml:id="trouble.master.startup.zk.buffer">
</section> <title>Packet len6080218 is out of range!</title>
<para>If you have many regions on your cluster and you see an error
like that reported above in this sections title in your logs, see
<link xlink:href="https://issues.apache.org/jira/browse/HBASE-4246">HBASE-4246 Cluster with too many regions cannot withstand some master failover scenarios</link>.</para>
</section>
</section>
<section xml:id="trouble.master.shutdown"> <section xml:id="trouble.master.shutdown">
<title>Shutdown Errors</title> <title>Shutdown Errors</title>
</section> </section>
</section> </section>
<section xml:id="trouble.zookeeper"> <section xml:id="trouble.zookeeper">
<title>ZooKeeper</title> <title>ZooKeeper</title>
@ -985,28 +991,28 @@ ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expi
<section xml:id="trouble.zookeeper.startup.address"> <section xml:id="trouble.zookeeper.startup.address">
<title>Could not find my address: xyz in list of ZooKeeper quorum servers</title> <title>Could not find my address: xyz in list of ZooKeeper quorum servers</title>
<para>A ZooKeeper server wasn't able to start, throws that error. xyz is the name of your server.</para> <para>A ZooKeeper server wasn't able to start, throws that error. xyz is the name of your server.</para>
<para>This is a name lookup problem. HBase tries to start a ZooKeeper server on some machine but that machine isn't able to find itself in the <varname>hbase.zookeeper.quorum</varname> configuration. <para>This is a name lookup problem. HBase tries to start a ZooKeeper server on some machine but that machine isn't able to find itself in the <varname>hbase.zookeeper.quorum</varname> configuration.
</para> </para>
<para>Use the hostname presented in the error message instead of the value you used. If you have a DNS server, you can set <varname>hbase.zookeeper.dns.interface</varname> and <varname>hbase.zookeeper.dns.nameserver</varname> in <filename>hbase-site.xml</filename> to make sure it resolves to the correct FQDN. <para>Use the hostname presented in the error message instead of the value you used. If you have a DNS server, you can set <varname>hbase.zookeeper.dns.interface</varname> and <varname>hbase.zookeeper.dns.nameserver</varname> in <filename>hbase-site.xml</filename> to make sure it resolves to the correct FQDN.
</para> </para>
</section> </section>
</section> </section>
<section xml:id="trouble.zookeeper.general"> <section xml:id="trouble.zookeeper.general">
<title>ZooKeeper, The Cluster Canary</title> <title>ZooKeeper, The Cluster Canary</title>
<para>ZooKeeper is the cluster's "canary in the mineshaft". It'll be the first to notice issues if any so making sure its happy is the short-cut to a humming cluster. <para>ZooKeeper is the cluster's "canary in the mineshaft". It'll be the first to notice issues if any so making sure its happy is the short-cut to a humming cluster.
</para> </para>
<para> <para>
See the <link xlink:href="http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting">ZooKeeper Operating Environment Troubleshooting</link> page. It has suggestions and tools for checking disk and networking performance; i.e. the operating environment your ZooKeeper and HBase are running in. See the <link xlink:href="http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting">ZooKeeper Operating Environment Troubleshooting</link> page. It has suggestions and tools for checking disk and networking performance; i.e. the operating environment your ZooKeeper and HBase are running in.
</para> </para>
<para>Additionally, the utility <xref linkend="trouble.tools.builtin.zkcli"/> may help investigate ZooKeeper issues. <para>Additionally, the utility <xref linkend="trouble.tools.builtin.zkcli"/> may help investigate ZooKeeper issues.
</para> </para>
</section> </section>
</section> </section>
<section xml:id="trouble.ec2"> <section xml:id="trouble.ec2">
<title>Amazon EC2</title> <title>Amazon EC2</title>
<section xml:id="trouble.ec2.zookeeper"> <section xml:id="trouble.ec2.zookeeper">
<title>ZooKeeper does not seem to work on Amazon EC2</title> <title>ZooKeeper does not seem to work on Amazon EC2</title>
<para>HBase does not start when deployed as Amazon EC2 instances. Exceptions like the below appear in the Master and/or RegionServer logs: </para> <para>HBase does not start when deployed as Amazon EC2 instances. Exceptions like the below appear in the Master and/or RegionServer logs: </para>
@ -1018,8 +1024,8 @@ ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expi
java.net.ConnectException: Connection refused java.net.ConnectException: Connection refused
</programlisting> </programlisting>
<para> <para>
Security group policy is blocking the ZooKeeper port on a public address. Security group policy is blocking the ZooKeeper port on a public address.
Use the internal EC2 host names when configuring the ZooKeeper quorum peer list. Use the internal EC2 host names when configuring the ZooKeeper quorum peer list.
</para> </para>
</section> </section>
<section xml:id="trouble.ec2.instability"> <section xml:id="trouble.ec2.instability">
@ -1033,11 +1039,11 @@ ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expi
See Andrew's answer here, up on the user list: <link xlink:href="http://search-hadoop.com/m/sPdqNFAwyg2">Remote Java client connection into EC2 instance</link>. See Andrew's answer here, up on the user list: <link xlink:href="http://search-hadoop.com/m/sPdqNFAwyg2">Remote Java client connection into EC2 instance</link>.
</para> </para>
</section> </section>
</section> </section>
<section xml:id="trouble.versions"> <section xml:id="trouble.versions">
<title>HBase and Hadoop version issues</title> <title>HBase and Hadoop version issues</title>
<section xml:id="trouble.versions.205"> <section xml:id="trouble.versions.205">
<title><code>NoClassDefFoundError</code> when trying to run 0.90.x on hadoop-0.20.205.x (or hadoop-1.0.x)</title> <title><code>NoClassDefFoundError</code> when trying to run 0.90.x on hadoop-0.20.205.x (or hadoop-1.0.x)</title>
<para>Apache HBase 0.90.x does not ship with hadoop-0.20.205.x, etc. To make it run, you need to replace the hadoop <para>Apache HBase 0.90.x does not ship with hadoop-0.20.205.x, etc. To make it run, you need to replace the hadoop
@ -1069,7 +1075,7 @@ org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate w
...</computeroutput> ...</computeroutput>
...are you trying to talk to an Hadoop 2.0.x from an HBase that has an Hadoop 1.0.x client? ...are you trying to talk to an Hadoop 2.0.x from an HBase that has an Hadoop 1.0.x client?
Use the HBase built against Hadoop 2.0 or rebuild your HBase passing the <command>-Dhadoop.profile=2.0</command> Use the HBase built against Hadoop 2.0 or rebuild your HBase passing the <command>-Dhadoop.profile=2.0</command>
attribute to Maven (See <xref linkend="maven.build.hadoop" /> for more). attribute to Maven (See <xref linkend="maven.build.hadoop" /> for more).
</para> </para>
</section> </section>
@ -1080,5 +1086,5 @@ attribute to Maven (See <xref linkend="maven.build.hadoop" /> for more).
<para>For Performance and Troubleshooting Case Studies, see <xref linkend="casestudies"/>. <para>For Performance and Troubleshooting Case Studies, see <xref linkend="casestudies"/>.
</para> </para>
</section> </section>
</chapter> </chapter>