From 26a45480474d69586583ef72dfeff9f183bba522 Mon Sep 17 00:00:00 2001 From: Doug Meil Date: Fri, 26 Aug 2011 20:45:51 +0000 Subject: [PATCH] HBASE-4262 adding ops_mgt.xml git-svn-id: https://svn.apache.org/repos/asf/hbase/trunk@1162245 13f79535-47bb-0310-9956-ffa450edef68 --- src/docbkx/book.xml | 213 +----------------------------------- src/docbkx/ops_mgt.xml | 237 +++++++++++++++++++++++++++++++++++++++++ 2 files changed, 239 insertions(+), 211 deletions(-) create mode 100644 src/docbkx/ops_mgt.xml diff --git a/src/docbkx/book.xml b/src/docbkx/book.xml index 433ed17c50d..7b5f6d0b3fa 100644 --- a/src/docbkx/book.xml +++ b/src/docbkx/book.xml @@ -1403,216 +1403,7 @@ HTable table2 = new HTable(conf2, "myTable"); - - - Tools - - Here we list HBase tools for administration, analysis, fixup, and - debugging. -
- HBase <application>hbck</application> - An fsck for your HBase install - To run hbck against your HBase cluster run - $ ./bin/hbase hbck - At the end of the commands output it prints OK - or INCONSISTENCY. If your cluster reports - inconsistencies, pass -details to see more detail emitted. - If inconsistencies, run hbck a few times because the - inconsistency may be transient (e.g. cluster is starting up or a region is - splitting). - Passing -fix may correct the inconsistency (This latter - is an experimental feature). - -
-
HFile Tool - See . -
-
- WAL Tools - -
- <classname>HLog</classname> tool - - The main method on HLog offers manual - split and dump facilities. Pass it WALs or the product of a split, the - content of the recovered.edits. directory. - - You can get a textual dump of a WAL file content by doing the - following: $ ./bin/hbase org.apache.hadoop.hbase.regionserver.wal.HLog --dump hdfs://example.org:8020/hbase/.logs/example.org,60020,1283516293161/10.10.21.10%3A60020.1283973724012 The - return code will be non-zero if issues with the file so you can test - wholesomeness of file by redirecting STDOUT to - /dev/null and testing the program return. - - Similarily you can force a split of a log file directory by - doing: $ ./bin/hbase org.apache.hadoop.hbase.regionserver.wal.HLog --split hdfs://example.org:8020/hbase/.logs/example.org,60020,1283516293161/ -
-
-
Compression Tool - See . -
-
Node Decommission - You can stop an individual RegionServer by running the following - script in the HBase directory on the particular node: - $ ./bin/hbase-daemon.sh stop regionserver - The RegionServer will first close all regions and then shut itself down. - On shutdown, the RegionServer's ephemeral node in ZooKeeper will expire. - The master will notice the RegionServer gone and will treat it as - a 'crashed' server; it will reassign the nodes the RegionServer was carrying. - Disable the Load Balancer before Decommissioning a node - If the load balancer runs while a node is shutting down, then - there could be contention between the Load Balancer and the - Master's recovery of the just decommissioned RegionServer. - Avoid any problems by disabling the balancer first. - See below. - - - - - A downside to the above stop of a RegionServer is that regions could be offline for - a good period of time. Regions are closed in order. If many regions on the server, the - first region to close may not be back online until all regions close and after the master - notices the RegionServer's znode gone. In HBase 0.90.2, we added facility for having - a node gradually shed its load and then shutdown itself down. HBase 0.90.2 added the - graceful_stop.sh script. Here is its usage: - $ ./bin/graceful_stop.sh -Usage: graceful_stop.sh [--config &conf-dir>] [--restart] [--reload] [--thrift] [--rest] &hostname> - thrift If we should stop/start thrift before/after the hbase stop/start - rest If we should stop/start rest before/after the hbase stop/start - restart If we should restart after graceful stop - reload Move offloaded regions back on to the stopped server - debug Move offloaded regions back on to the stopped server - hostname Hostname of server we are to stop - - - To decommission a loaded RegionServer, run the following: - $ ./bin/graceful_stop.sh HOSTNAME - where HOSTNAME is the host carrying the RegionServer - you would decommission. - On <varname>HOSTNAME</varname> - The HOSTNAME passed to graceful_stop.sh - must match the hostname that hbase is using to identify RegionServers. - Check the list of RegionServers in the master UI for how HBase is - referring to servers. Its usually hostname but can also be FQDN. - Whatever HBase is using, this is what you should pass the - graceful_stop.sh decommission - script. If you pass IPs, the script is not yet smart enough to make - a hostname (or FQDN) of it and so it will fail when it checks if server is - currently running; the graceful unloading of regions will not run. - - The graceful_stop.sh script will move the regions off the - decommissioned RegionServer one at a time to minimize region churn. - It will verify the region deployed in the new location before it - will moves the next region and so on until the decommissioned server - is carrying zero regions. At this point, the graceful_stop.sh - tells the RegionServer stop. The master will at this point notice the - RegionServer gone but all regions will have already been redeployed - and because the RegionServer went down cleanly, there will be no - WAL logs to split. - Load Balancer - - It is assumed that the Region Load Balancer is disabled while the - graceful_stop script runs (otherwise the balancer - and the decommission script will end up fighting over region deployments). - Use the shell to disable the balancer: - hbase(main):001:0> balance_switch false -true -0 row(s) in 0.3590 seconds -This turns the balancer OFF. To reenable, do: - hbase(main):001:0> balance_switch true -false -0 row(s) in 0.3590 seconds - - - -
- Rolling Restart - - You can also ask this script to restart a RegionServer after the shutdown - AND move its old regions back into place. The latter you might do to - retain data locality. A primitive rolling restart might be effected by - running something like the following: - $ for i in `cat conf/regionservers|sort`; do ./bin/graceful_stop.sh --restart --reload --debug $i; done &> /tmp/log.txt & - - Tail the output of /tmp/log.txt to follow the scripts - progress. The above does RegionServers only. Be sure to disable the - load balancer before doing the above. You'd need to do the master - update separately. Do it before you run the above script. - Here is a pseudo-script for how you might craft a rolling restart script: - - Untar your release, make sure of its configuration and - then rsync it across the cluster. If this is 0.90.2, patch it - with HBASE-3744 and HBASE-3756. - - - - Run hbck to ensure the cluster consistent - $ ./bin/hbase hbck - Effect repairs if inconsistent. - - - - Restart the Master: $ ./bin/hbase-daemon.sh stop master; ./bin/hbase-daemon.sh start master - - - - - Disable the region balancer:$ echo "balance_switch false" | ./bin/hbase shell - - - - Run the graceful_stop.sh script per RegionServer. For example: - $ for i in `cat conf/regionservers|sort`; do ./bin/graceful_stop.sh --restart --reload --debug $i; done &> /tmp/log.txt & - - If you are running thrift or rest servers on the RegionServer, pass --thrift or --rest options (See usage - for graceful_stop.sh script). - - - - Restart the Master again. This will clear out dead servers list and reenable the balancer. - - - - Run hbck to ensure the cluster is consistent. - - - - -
- -
-
- CopyTable - - CopyTable is a utility that can copy part or of all of a table, either to the same cluster or another cluster. The usage is as follows: -$ bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable [--rs.class=CLASS] [--rs.impl=IMPL] [--starttime=X] [--endtime=Y] [--new.name=NEW] [--peer.adr=ADR] tablename - - - - Options: - - rs.class hbase.regionserver.class of the peer cluster. Specify if different from current cluster. - rs.impl hbase.regionserver.impl of the peer cluster. - starttime Beginning of the time range. Without endtime means starttime to forever. - endtime End of the time range. Without endtime means starttime to forever. - new.name New table's name. - peer.adr Address of the peer cluster given in the format hbase.zookeeper.quorum:hbase.zookeeper.client.port:zookeeper.znode.parent - families Comma-separated list of ColumnFamilies to copy. - - Args: - - tablename Name of table to copy. - - - Example of copying 'TestTable' to a cluster that uses replication for a 1 hour window: -$ bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable ---rs.class=org.apache.hadoop.hbase.ipc.ReplicationRegionInterface ---rs.impl=org.apache.hadoop.hbase.regionserver.replication.ReplicationRegionServer ---starttime=1265875194289 --endtime=1265878794289 ---peer.adr=server1,server2,server3:2181:/hbase TestTable - -
- -
+ @@ -1852,7 +1643,7 @@ When I build, why do I always get Unable to find resource 'VM_global_libra - See HBase Backup Options over on the Sematext Blog. + See diff --git a/src/docbkx/ops_mgt.xml b/src/docbkx/ops_mgt.xml new file mode 100644 index 00000000000..35819fb1672 --- /dev/null +++ b/src/docbkx/ops_mgt.xml @@ -0,0 +1,237 @@ + + + HBase Operational Management + This chapter will cover operational tools and practices required of a running HBase cluster. + The subject of operations is related to the topics of , , + and but is a distinct topic in itself. + +
+ HBase Tools and Utilities + + Here we list HBase tools for administration, analysis, fixup, and + debugging. +
+ HBase <application>hbck</application> + An fsck for your HBase install + To run hbck against your HBase cluster run + $ ./bin/hbase hbck + At the end of the commands output it prints OK + or INCONSISTENCY. If your cluster reports + inconsistencies, pass -details to see more detail emitted. + If inconsistencies, run hbck a few times because the + inconsistency may be transient (e.g. cluster is starting up or a region is + splitting). + Passing -fix may correct the inconsistency (This latter + is an experimental feature). + +
+
HFile Tool + See . +
+
+ WAL Tools + +
+ <classname>HLog</classname> tool + + The main method on HLog offers manual + split and dump facilities. Pass it WALs or the product of a split, the + content of the recovered.edits. directory. + + You can get a textual dump of a WAL file content by doing the + following: $ ./bin/hbase org.apache.hadoop.hbase.regionserver.wal.HLog --dump hdfs://example.org:8020/hbase/.logs/example.org,60020,1283516293161/10.10.21.10%3A60020.1283973724012 The + return code will be non-zero if issues with the file so you can test + wholesomeness of file by redirecting STDOUT to + /dev/null and testing the program return. + + Similarily you can force a split of a log file directory by + doing: $ ./bin/hbase org.apache.hadoop.hbase.regionserver.wal.HLog --split hdfs://example.org:8020/hbase/.logs/example.org,60020,1283516293161/ +
+
+
Compression Tool + See . +
+
+ CopyTable + + CopyTable is a utility that can copy part or of all of a table, either to the same cluster or another cluster. The usage is as follows: +$ bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable [--rs.class=CLASS] [--rs.impl=IMPL] [--starttime=X] [--endtime=Y] [--new.name=NEW] [--peer.adr=ADR] tablename + + + + Options: + + rs.class hbase.regionserver.class of the peer cluster. Specify if different from current cluster. + rs.impl hbase.regionserver.impl of the peer cluster. + starttime Beginning of the time range. Without endtime means starttime to forever. + endtime End of the time range. Without endtime means starttime to forever. + new.name New table's name. + peer.adr Address of the peer cluster given in the format hbase.zookeeper.quorum:hbase.zookeeper.client.port:zookeeper.znode.parent + families Comma-separated list of ColumnFamilies to copy. + + Args: + + tablename Name of table to copy. + + + Example of copying 'TestTable' to a cluster that uses replication for a 1 hour window: +$ bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable +--rs.class=org.apache.hadoop.hbase.ipc.ReplicationRegionInterface +--rs.impl=org.apache.hadoop.hbase.regionserver.replication.ReplicationRegionServer +--starttime=1265875194289 --endtime=1265878794289 +--peer.adr=server1,server2,server3:2181:/hbase TestTable + +
+
+ +
Node Management +
Node Decommission + You can stop an individual RegionServer by running the following + script in the HBase directory on the particular node: + $ ./bin/hbase-daemon.sh stop regionserver + The RegionServer will first close all regions and then shut itself down. + On shutdown, the RegionServer's ephemeral node in ZooKeeper will expire. + The master will notice the RegionServer gone and will treat it as + a 'crashed' server; it will reassign the nodes the RegionServer was carrying. + Disable the Load Balancer before Decommissioning a node + If the load balancer runs while a node is shutting down, then + there could be contention between the Load Balancer and the + Master's recovery of the just decommissioned RegionServer. + Avoid any problems by disabling the balancer first. + See below. + + + + + A downside to the above stop of a RegionServer is that regions could be offline for + a good period of time. Regions are closed in order. If many regions on the server, the + first region to close may not be back online until all regions close and after the master + notices the RegionServer's znode gone. In HBase 0.90.2, we added facility for having + a node gradually shed its load and then shutdown itself down. HBase 0.90.2 added the + graceful_stop.sh script. Here is its usage: + $ ./bin/graceful_stop.sh +Usage: graceful_stop.sh [--config &conf-dir>] [--restart] [--reload] [--thrift] [--rest] &hostname> + thrift If we should stop/start thrift before/after the hbase stop/start + rest If we should stop/start rest before/after the hbase stop/start + restart If we should restart after graceful stop + reload Move offloaded regions back on to the stopped server + debug Move offloaded regions back on to the stopped server + hostname Hostname of server we are to stop + + + To decommission a loaded RegionServer, run the following: + $ ./bin/graceful_stop.sh HOSTNAME + where HOSTNAME is the host carrying the RegionServer + you would decommission. + On <varname>HOSTNAME</varname> + The HOSTNAME passed to graceful_stop.sh + must match the hostname that hbase is using to identify RegionServers. + Check the list of RegionServers in the master UI for how HBase is + referring to servers. Its usually hostname but can also be FQDN. + Whatever HBase is using, this is what you should pass the + graceful_stop.sh decommission + script. If you pass IPs, the script is not yet smart enough to make + a hostname (or FQDN) of it and so it will fail when it checks if server is + currently running; the graceful unloading of regions will not run. + + The graceful_stop.sh script will move the regions off the + decommissioned RegionServer one at a time to minimize region churn. + It will verify the region deployed in the new location before it + will moves the next region and so on until the decommissioned server + is carrying zero regions. At this point, the graceful_stop.sh + tells the RegionServer stop. The master will at this point notice the + RegionServer gone but all regions will have already been redeployed + and because the RegionServer went down cleanly, there will be no + WAL logs to split. + Load Balancer + + It is assumed that the Region Load Balancer is disabled while the + graceful_stop script runs (otherwise the balancer + and the decommission script will end up fighting over region deployments). + Use the shell to disable the balancer: + hbase(main):001:0> balance_switch false +true +0 row(s) in 0.3590 seconds +This turns the balancer OFF. To reenable, do: + hbase(main):001:0> balance_switch true +false +0 row(s) in 0.3590 seconds + + + +
+
+ Rolling Restart + + You can also ask this script to restart a RegionServer after the shutdown + AND move its old regions back into place. The latter you might do to + retain data locality. A primitive rolling restart might be effected by + running something like the following: + $ for i in `cat conf/regionservers|sort`; do ./bin/graceful_stop.sh --restart --reload --debug $i; done &> /tmp/log.txt & + + Tail the output of /tmp/log.txt to follow the scripts + progress. The above does RegionServers only. Be sure to disable the + load balancer before doing the above. You'd need to do the master + update separately. Do it before you run the above script. + Here is a pseudo-script for how you might craft a rolling restart script: + + Untar your release, make sure of its configuration and + then rsync it across the cluster. If this is 0.90.2, patch it + with HBASE-3744 and HBASE-3756. + + + + Run hbck to ensure the cluster consistent + $ ./bin/hbase hbck + Effect repairs if inconsistent. + + + + Restart the Master: $ ./bin/hbase-daemon.sh stop master; ./bin/hbase-daemon.sh start master + + + + + Disable the region balancer:$ echo "balance_switch false" | ./bin/hbase shell + + + + Run the graceful_stop.sh script per RegionServer. For example: + $ for i in `cat conf/regionservers|sort`; do ./bin/graceful_stop.sh --restart --reload --debug $i; done &> /tmp/log.txt & + + If you are running thrift or rest servers on the RegionServer, pass --thrift or --rest options (See usage + for graceful_stop.sh script). + + + + Restart the Master again. This will clear out dead servers list and reenable the balancer. + + + + Run hbck to ensure the cluster is consistent. + + + + +
+
+ +
+ HBase Monitoring + TODO + +
+
+ HBase Backup + See HBase Backup Options over on the Sematext Blog. + +
+ +