diff --git a/src/main/docbkx/configuration.xml b/src/main/docbkx/configuration.xml index d999e93631e..ee64ed14bf7 100644 --- a/src/main/docbkx/configuration.xml +++ b/src/main/docbkx/configuration.xml @@ -1216,11 +1216,65 @@ of all regions.
Better Mean Time to Recover (MTTR) - See the Deveraj Das an Nicolas Liochon blog post + This section is about configurations that will make servers come back faster after a fail. + See the Deveraj Das an Nicolas Liochon blog post Introduction to HBase Mean Time to Recover (MTTR) - for a brief introduction. The issue HBASE-8354 forces Namenode into loop with lease recovery requests + for a brief introduction. + The issue HBASE-8354 forces Namenode into loop with lease recovery requests is messy but has a bunch of good discussion toward the end on low timeouts and how to effect faster recovery including citation of fixes - added to HDFS. Read the Varun Sharma comments. + added to HDFS. Read the Varun Sharma comments. The below suggested configurations are Varun's suggestions distilled and tested. Make sure you are + running on a late-version HDFS so you have the fixes he refers too and himself adds to HDFS that help HBase MTTR + (e.g. HDFS-3703, HDFS-3712, and HDFS-4791 -- hadoop 2 for sure has them and late hadoop 1 has some). + Set the following in the RegionServer. + + hbase.lease.recovery.dfs.timeout + 23000 + How much time we allow elapse between calls to recover lease. + Should be larger than the dfs timeout. + + + dfs.client.socket-timeout + 10000 + Down the DFS timeout from 60 to 10 seconds. +]]> +And on the namenode/datanode side, set the following to enable 'staleness' introduced in HDFS-3703, HDFS-3912. + + dfs.client.socket-timeout + 10000 + Down the DFS timeout from 60 to 10 seconds. + + + dfs.datanode.socket.write.timeout + 10000 + Down the DFS timeout from 8 * 60 to 10 seconds. + + + ipc.client.connect.timeout + 3000 + Down from 60 seconds to 3. + + + ipc.client.connect.max.retries.on.timeouts + 2 + Down from 45 seconds to 3 (2 == 3 retries). + + + dfs.namenode.avoid.read.stale.datanode + true + Enable stale state in hdfs + + + dfs.namenode.stale.datanode.interval + 20000 + Down from default 30 seconds + + + dfs.namenode.avoid.write.stale.datanode + true + Enable stale state in hdfs +]]> + +