Add note on 'bad disk'
git-svn-id: https://svn.apache.org/repos/asf/hbase/trunk@1415759 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
parent
82faf90366
commit
f5a3a12b24
|
@ -941,6 +941,8 @@ index e70ebc6..96f8c27 100644
|
|||
</section>
|
||||
|
||||
<section xml:id="recommended_configurations"><title>Recommended Configurations</title>
|
||||
<section xml:id="recommended_configurations.zk">
|
||||
<title>ZooKeeper Configuration</title>
|
||||
<section xml:id="zookeeper.session.timeout"><title><varname>zookeeper.session.timeout</varname></title>
|
||||
<para>The default timeout is three minutes (specified in milliseconds). This means
|
||||
that if a server crashes, it will be three minutes before the Master notices
|
||||
|
@ -966,6 +968,18 @@ index e70ebc6..96f8c27 100644
|
|||
<section xml:id="zookeeper.instances"><title>Number of ZooKeeper Instances</title>
|
||||
<para>See <xref linkend="zookeeper"/>.
|
||||
</para>
|
||||
</section>
|
||||
</section>
|
||||
<section xml:id="recommended.configurations.hdfs">
|
||||
<title>HDFS Configurations</title>
|
||||
<section xml:id="dfs.datanode.failed.volumes.tolerated">
|
||||
<title>dfs.datanode.failed.volumes.tolerated</title>
|
||||
<para>This is the "...number of volumes that are allowed to fail before a datanode stops offering service. By default
|
||||
any volume failure will cause a datanode to shutdown" from the <filename>hdfs-default.xml</filename>
|
||||
description. If you have > three or four disks, you might want to set this to 1 or if you have many disks,
|
||||
two or more.
|
||||
</para>
|
||||
</section>
|
||||
</section>
|
||||
<section xml:id="hbase.regionserver.handler.count"><title><varname>hbase.regionserver.handler.count</varname></title>
|
||||
<para>
|
||||
|
|
|
@ -380,6 +380,20 @@ false
|
|||
</para>
|
||||
</note>
|
||||
</para>
|
||||
<section xml:id="bad.disk">
|
||||
<title>Bad or Failing Disk</title>
|
||||
<para>It is good having <xref linkend="dfs.datanode.failed.volumes.tolerated" /> set if you have a decent number of disks
|
||||
per machine for the case where a disk plain dies. But usually disks do the "John Wayne" -- i.e. take a while
|
||||
to go down spewing errors in <filename>dmesg</filename> -- or for some reason, run much slower than their
|
||||
companions. In this case you want to decommission the disk. You have two options. You can
|
||||
<xlink href="http://wiki.apache.org/hadoop/FAQ#I_want_to_make_a_large_cluster_smaller_by_taking_out_a_bunch_of_nodes_simultaneously._How_can_this_be_done.3F">decommission the datanode</xlink>
|
||||
or, less disruptive in that only the bad disks data will be rereplicated, is that you can stop the datanode,
|
||||
unmount the bad volume (You can't umount a volume while the datanode is using it), and then restart the
|
||||
datanode (presuming you have set dfs.datanode.failed.volumes.tolerated > 0). The regionserver will
|
||||
throw some errors in its logs as it recalibrates where to get its data from -- it will likely
|
||||
roll its WAL log too -- but in general but for some latency spikes, it should keep on chugging.
|
||||
</para>
|
||||
</section>
|
||||
</section>
|
||||
<section xml:id="rolling">
|
||||
<title>Rolling Restart</title>
|
||||
|
|
Loading…
Reference in New Issue