diff --git a/src/docbkx/ops_mgt.xml b/src/docbkx/ops_mgt.xml index 16c19da8960..1edd5c05130 100644 --- a/src/docbkx/ops_mgt.xml +++ b/src/docbkx/ops_mgt.xml @@ -387,11 +387,18 @@ false to go down spewing errors in dmesg -- or for some reason, run much slower than their companions. In this case you want to decommission the disk. You have two options. You can decommission the datanode - or, less disruptive in that only the bad disks data will be rereplicated, is that you can stop the datanode, + or, less disruptive in that only the bad disks data will be rereplicated, can stop the datanode, unmount the bad volume (You can't umount a volume while the datanode is using it), and then restart the datanode (presuming you have set dfs.datanode.failed.volumes.tolerated > 0). The regionserver will throw some errors in its logs as it recalibrates where to get its data from -- it will likely roll its WAL log too -- but in general but for some latency spikes, it should keep on chugging. + + If you are doing short-circuit reads, you will have to move the regions off the regionserver + before you stop the datanode; when short-circuiting reading, though chmod'd so regionserver cannot + have access, because it already has the files open, it will be able to keep reading the file blocks + from the bad disk even though the datanode is down. Move the regions back after you restart the + datanode. +