From 7997c5187fac0968435cb0bd19553f56cc7e00c3 Mon Sep 17 00:00:00 2001 From: Wellington Chevreuil Date: Wed, 7 Nov 2018 09:30:24 +0000 Subject: [PATCH] HBASE-15557 Add guidance on HashTable/SyncTable to the RefGuide Signed-off-by: Sean Busbey --- src/main/asciidoc/_chapters/ops_mgt.adoc | 118 +++++++++++++++++++++++ 1 file changed, 118 insertions(+) diff --git a/src/main/asciidoc/_chapters/ops_mgt.adoc b/src/main/asciidoc/_chapters/ops_mgt.adoc index ae5507f4ce5..200c380051d 100644 --- a/src/main/asciidoc/_chapters/ops_mgt.adoc +++ b/src/main/asciidoc/_chapters/ops_mgt.adoc @@ -531,6 +531,124 @@ By default, CopyTable utility only copies the latest version of row cells unless See Jonathan Hsieh's link:https://blog.cloudera.com/blog/2012/06/online-hbase-backups-with-copytable-2/[Online HBase Backups with CopyTable] blog post for more on `CopyTable`. +[[hashtable.synctable]] +=== HashTable/SyncTable + +HashTable/SyncTable is a two steps tool for synchronizing table data, where each of the steps are implemented as MapReduce jobs. +Similarly to CopyTable, it can be used for partial or entire table data syncing, under same or remote cluster. +However, it performs the sync in a more efficient way than CopyTable. Instead of copying all cells +in specified row key/time period range, HashTable (the first step) creates hashed indexes for batch of cells on source table and output those as results. +On the next stage, SyncTable scans the source table and now calculates hash indexes for table cells, +compares these hashes with the outputs of HashTable, then it just scans (and compares) cells for diverging hashes, only updating +mismatching cells. This results in less network traffic/data transfers, which can be impacting when syncing large tables on remote clusters. + +==== Step 1, HashTable + +First, run HashTable on the source table cluster (this is the table whose state will be copied to its counterpart). + +Usage: + +---- +$ ./bin/hbase org.apache.hadoop.hbase.mapreduce.HashTable --help +Usage: HashTable [options] + +Options: + batchsize the target amount of bytes to hash in each batch + rows are added to the batch until this size is reached + (defaults to 8000 bytes) + numhashfiles the number of hash files to create + if set to fewer than number of regions then + the job will create this number of reducers + (defaults to 1/100 of regions -- at least 1) + startrow the start row + stoprow the stop row + starttime beginning of the time range (unixtime in millis) + without endtime means from starttime to forever + endtime end of the time range. Ignored if no starttime specified. + scanbatch scanner batch size to support intra row scans + versions number of cell versions to include + families comma-separated list of families to include + +Args: + tablename Name of the table to hash + outputpath Filesystem path to put the output data + +Examples: + To hash 'TestTable' in 32kB batches for a 1 hour window into 50 files: + $ bin/hbase org.apache.hadoop.hbase.mapreduce.HashTable --batchsize=32000 --numhashfiles=50 --starttime=1265875194289 --endtime=1265878794289 --families=cf2,cf3 TestTable /hashes/testTable +---- + +The *batchsize* property defines how much cell data for a given region will be hashed together in a single hash value. +Sizing this properly has a direct impact on the sync efficiency, as it may lead to less scans executed by mapper tasks +of SyncTable (the next step in the process). The rule of thumb is that, the smaller the number of cells out of sync +(lower probability of finding a diff), larger batch size values can be determined. + +==== Step 2, SyncTable + +Once HashTable has completed on source cluster, SyncTable can be ran on target cluster. +Just like replication and other synchronization jobs, it requires that all RegionServers/DataNodes +on source cluster be accessible by NodeManagers on the target cluster (where SyncTable job tasks will be running). + +Usage: + +---- +$ ./bin/hbase org.apache.hadoop.hbase.mapreduce.SyncTable --help +Usage: SyncTable [options] + +Options: + sourcezkcluster ZK cluster key of the source table + (defaults to cluster in classpath's config) + targetzkcluster ZK cluster key of the target table + (defaults to cluster in classpath's config) + dryrun if true, output counters but no writes + (defaults to false) + doDeletes if false, does not perform deletes + (defaults to true) + doPuts if false, does not perform puts + (defaults to true) + +Args: + sourcehashdir path to HashTable output dir for source table + (see org.apache.hadoop.hbase.mapreduce.HashTable) + sourcetable Name of the source table to sync from + targettable Name of the target table to sync to + +Examples: + For a dry run SyncTable of tableA from a remote source cluster + to a local target cluster: + $ bin/hbase org.apache.hadoop.hbase.mapreduce.SyncTable --dryrun=true --sourcezkcluster=zk1.example.com,zk2.example.com,zk3.example.com:2181:/hbase hdfs://nn:9000/hashes/tableA tableA tableA +---- + +The *dryrun* option is useful when a read only, diff report is wanted, as it will produce only COUNTERS indicating the differences, but will not perform +any actual changes. It can be used as an alternative to VerifyReplication tool. + +By default, SyncTable will cause target table to become an exact copy of source table (at least, for the specified startrow/stoprow or/and starttime/endtime). + +Setting doDeletes to false modifies default behaviour to not delete target cells that are missing on source. +Similarly, setting doPuts to false modifies default behaviour to not add missing cells on target. Setting both doDeletes +and doPuts to false would give same effect as setting dryrun to true. + +.Set doDeletes to false on Two-Way Replication scenarios +[NOTE] +==== +On Two-Way Replication or other scenarios where both source and target clusters can have data ingested, it's advisable to always set doDeletes option to false, +as any additional cell inserted on SyncTable target cluster and not yet replicated to source would be deleted, and potentially lost permanently. +==== + +.Set sourcezkcluster to the actual source cluster ZK quorum +[NOTE] +==== +Although not required, if sourcezkcluster is not set, SyncTable will connect to local HBase cluster for both source and target, +which does not give any meaningful result. +==== + +.Remote Clusters on different Kerberos Realms +[NOTE] +==== +Currently, SyncTable can't be ran for remote clusters on different Kerberos realms. +There's some work in progress to resolve this on link:https://jira.apache.org/jira/browse/HBASE-20586[HBASE-20586] +==== + [[export]] === Export