diff --git a/src/docbkx/ops_mgt.xml b/src/docbkx/ops_mgt.xml index d9b10e469fd..623bfb6037c 100644 --- a/src/docbkx/ops_mgt.xml +++ b/src/docbkx/ops_mgt.xml @@ -29,16 +29,16 @@ Apache HBase (TM) Operational Management This chapter will cover operational tools and practices required of a running Apache HBase cluster. The subject of operations is related to the topics of , , - and but is a distinct topic in itself. - + and but is a distinct topic in itself. +
HBase Tools and Utilities Here we list HBase tools for administration, analysis, fixup, and debugging.
Driver - There is a Driver class that is executed by the HBase jar can be used to invoke frequently accessed utilities. For example, -HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-VERSION.jar + There is a Driver class that is executed by the HBase jar can be used to invoke frequently accessed utilities. For example, +HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-VERSION.jar ... will return... @@ -159,7 +159,7 @@ Valid program names are:
ImportTsv - ImportTsv is a utility that will load data in TSV format into HBase. It has two distinct usages: loading data from TSV format in HDFS + ImportTsv is a utility that will load data in TSV format into HBase. It has two distinct usages: loading data from TSV format in HDFS into HBase via Puts, and preparing StoreFiles to be loaded via the completebulkload. To load data via Puts (i.e., non-bulk loading): @@ -170,7 +170,7 @@ Valid program names are: $ bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=a,b,c -Dimporttsv.bulk.output=hdfs://storefile-outputdir <tablename> <hdfs-data-inputdir> - These generated StoreFiles can be loaded into HBase via . + These generated StoreFiles can be loaded into HBase via .
ImportTsv Options Running ImportTsv with no arguments prints brief usage information: @@ -197,7 +197,7 @@ Other options that may be specified with -D include: '-Dimporttsv.separator=|' - eg separate on pipes instead of tabs -Dimporttsv.timestamp=currentTimeAsLong - use the specified timestamp for the import -Dimporttsv.mapper.class=my.Mapper - A user-defined Mapper to use instead of org.apache.hadoop.hbase.mapreduce.TsvImporterMapper - +
ImportTsv Example For example, assume that we are loading data into a table called 'datatsv' with a ColumnFamily called 'd' with two columns "c1" and "c2". @@ -229,15 +229,15 @@ row10 c1 c2
See Also For more information about bulk-loading HFiles into HBase, see -
+
- +
CompleteBulkLoad The completebulkload utility will move generated StoreFiles into an HBase table. This utility is often used - in conjunction with output from . + in conjunction with output from . - There are two ways to invoke this utility, with explicit classname and via the driver: + There are two ways to invoke this utility, with explicit classname and via the driver: $ bin/hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles <hdfs://storefileoutput> <tablename> .. and via the Driver.. @@ -266,15 +266,17 @@ row10 c1 c2
RowCounter - RowCounter is a utility that will count all the rows of a table. This is a good utility to use - as a sanity check to ensure that HBase can read all the blocks of a table if there are any concerns of metadata inconsistency. + RowCounter is a mapreduce job to count all the rows of a table. This is a good utility to use + as a sanity check to ensure that HBase can read all the blocks of a table if there are any concerns of metadata inconsistency. + It will run the mapreduce all in a single process but it will run faster if you have a MapReduce cluster in place for it to + exploit. $ bin/hbase org.apache.hadoop.hbase.mapreduce.RowCounter <tablename> [<column1> <column2>...] Note: caching for the input Scan is configured via hbase.client.scanner.caching in the job configuration.
- +
@@ -284,7 +286,7 @@ row10 c1 c2 Major compactions can be requested via the HBase shell or HBaseAdmin.majorCompact. Note: major compactions do NOT do region merges. See for more information about compactions. - +
@@ -293,16 +295,16 @@ row10 c1 c2 $ bin/hbase org.apache.hbase.util.Merge <tablename> <region1> <region2> If you feel you have too many regions and want to consolidate them, Merge is the utility you need. Merge must - run be done when the cluster is down. + run be done when the cluster is down. See the O'Reilly HBase Book for an example of usage. - Additionally, there is a Ruby script attached to HBASE-1621 + Additionally, there is a Ruby script attached to HBASE-1621 for region merging.
- +
Node Management
Node Decommission You can stop an individual RegionServer by running the following @@ -328,7 +330,7 @@ row10 c1 c2 notices the RegionServer's znode gone. In Apache HBase 0.90.2, we added facility for having a node gradually shed its load and then shutdown itself down. Apache HBase 0.90.2 added the graceful_stop.sh script. Here is its usage: - $ ./bin/graceful_stop.sh + $ ./bin/graceful_stop.sh Usage: graceful_stop.sh [--config &conf-dir>] [--restart] [--reload] [--thrift] [--rest] &hostname> thrift If we should stop/start thrift before/after the hbase stop/start rest If we should stop/start rest before/after the hbase stop/start @@ -341,7 +343,7 @@ Usage: graceful_stop.sh [--config &conf-dir>] [--restart] [--reload] [--thri To decommission a loaded RegionServer, run the following: $ ./bin/graceful_stop.sh HOSTNAME where HOSTNAME is the host carrying the RegionServer - you would decommission. + you would decommission. On <varname>HOSTNAME</varname> The HOSTNAME passed to graceful_stop.sh must match the hostname that hbase is using to identify RegionServers. @@ -363,7 +365,7 @@ Usage: graceful_stop.sh [--config &conf-dir>] [--restart] [--reload] [--thri and because the RegionServer went down cleanly, there will be no WAL logs to split. Load Balancer - + It is assumed that the Region Load Balancer is disabled while the graceful_stop script runs (otherwise the balancer and the decommission script will end up fighting over region deployments). @@ -375,10 +377,10 @@ This turns the balancer OFF. To reenable, do: hbase(main):001:0> balance_switch true false 0 row(s) in 0.3590 seconds - + -
+
Rolling Restart @@ -521,33 +523,33 @@ false Overview The following metrics are arguably the most important to monitor for each RegionServer for "macro monitoring", preferably with a system like OpenTSDB. - If your cluster is having performance issues it's likely that you'll see something unusual with + If your cluster is having performance issues it's likely that you'll see something unusual with this group. - HBase: + HBase: Requests Compactions queue - - OS: + + OS: IO Wait User CPU - - Java: + + Java: GC - + For more information on HBase metrics, see .
- +
Slow Query Log The HBase slow query log consists of parseable JSON structures describing the properties of those client operations (Gets, Puts, Deletes, etc.) that either took too long to run, or produced too much output. The thresholds for "too long to run" and "too much output" are configurable, as described below. The output is produced inline in the main region server logs so that it is easy to discover further details from context with other logged events. It is also prepended with identifying tags (responseTooSlow), (responseTooLarge), (operationTooSlow), and (operationTooLarge) in order to enable easy filtering with grep, in case the user desires to see only slow queries. @@ -594,7 +596,7 @@ false
- +
Cluster Replication See Cluster Replication. @@ -602,8 +604,8 @@ false
HBase Backup - There are two broad strategies for performing HBase backups: backing up with a full cluster shutdown, and backing up on a live cluster. - Each approach has pros and cons. + There are two broad strategies for performing HBase backups: backing up with a full cluster shutdown, and backing up on a live cluster. + Each approach has pros and cons. For additional information, see HBase Backup Options over on the Sematext Blog. @@ -617,27 +619,27 @@ false
Distcp - Distcp could be used to either copy the contents of the HBase directory in HDFS to either the same cluster in another directory, or + Distcp could be used to either copy the contents of the HBase directory in HDFS to either the same cluster in another directory, or to a different cluster. - Note: Distcp works in this situation because the cluster is down and there are no in-flight edits to files. + Note: Distcp works in this situation because the cluster is down and there are no in-flight edits to files. Distcp-ing of files in the HBase directory is not generally recommended on a live cluster.
Restore (if needed) - The backup of the hbase directory from HDFS is copied onto the 'real' hbase directory via distcp. The act of copying these files + The backup of the hbase directory from HDFS is copied onto the 'real' hbase directory via distcp. The act of copying these files creates new HDFS metadata, which is why a restore of the NameNode edits from the time of the HBase backup isn't required for this kind of restore, because it's a restore (via distcp) of a specific HDFS directory (i.e., the HBase part) not the entire HDFS file-system.
Live Cluster Backup - Replication - This approach assumes that there is a second cluster. + This approach assumes that there is a second cluster. See the HBase page on replication for more information.
Live Cluster Backup - CopyTable - The utility could either be used to copy data from one table to another on the + The utility could either be used to copy data from one table to another on the same cluster, or to copy data to another table on another cluster. Since the cluster is up, there is a risk that edits could be missed in the copy process. @@ -658,10 +660,10 @@ false with a solid understanding of how HBase handles data internally (KeyValue).
KeyValue - HBase storage will be dominated by KeyValues. See and for - how HBase stores data internally. + HBase storage will be dominated by KeyValues. See and for + how HBase stores data internally. - It is critical to understand that there is a KeyValue instance for every attribute stored in a row, and the + It is critical to understand that there is a KeyValue instance for every attribute stored in a row, and the rowkey-length, ColumnFamily name-length and attribute lengths will drive the size of the database more than any other factor. diff --git a/src/docbkx/shell.xml b/src/docbkx/shell.xml index f341e7e6213..2a153533618 100644 --- a/src/docbkx/shell.xml +++ b/src/docbkx/shell.xml @@ -32,7 +32,7 @@ The Apache HBase (TM) Shell is (J)Ruby's IRB with some HBase particular commands added. Anything you can do in IRB, you should be able to do in the HBase Shell. - To run the HBase shell, + To run the HBase shell, do as follows: $ ./bin/hbase shell @@ -104,5 +104,16 @@
+
Commands +
count + Count command returns the number of rows in a table. + It's quite fast when configured with the right CACHE + hbase> count '<tablename>', CACHE => 1000 + The above count fetches 1000 rows at a time. Set CACHE lower if your rows are big. + Default is to fetch one row at a time. + +
+
+