More on bad disk handling

git-svn-id: https://svn.apache.org/repos/asf/hbase/trunk@1417657 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
Michael Stack 2012-12-05 21:31:57 +00:00
parent a3500ea1c8
commit 39a0f56d47
1 changed files with 8 additions and 1 deletions

View File

@ -387,11 +387,18 @@ false
to go down spewing errors in <filename>dmesg</filename> -- or for some reason, run much slower than their
companions. In this case you want to decommission the disk. You have two options. You can
<xlink href="http://wiki.apache.org/hadoop/FAQ#I_want_to_make_a_large_cluster_smaller_by_taking_out_a_bunch_of_nodes_simultaneously._How_can_this_be_done.3F">decommission the datanode</xlink>
or, less disruptive in that only the bad disks data will be rereplicated, is that you can stop the datanode,
or, less disruptive in that only the bad disks data will be rereplicated, can stop the datanode,
unmount the bad volume (You can't umount a volume while the datanode is using it), and then restart the
datanode (presuming you have set dfs.datanode.failed.volumes.tolerated > 0). The regionserver will
throw some errors in its logs as it recalibrates where to get its data from -- it will likely
roll its WAL log too -- but in general but for some latency spikes, it should keep on chugging.
<note>
<para>If you are doing short-circuit reads, you will have to move the regions off the regionserver
before you stop the datanode; when short-circuiting reading, though chmod'd so regionserver cannot
have access, because it already has the files open, it will be able to keep reading the file blocks
from the bad disk even though the datanode is down. Move the regions back after you restart the
datanode.</para>
</note>
</para>
</section>
</section>