HBASE-4931 [docs] CopyTable instructions could be improved (Misty Stanley-Jones)

2014-07-10 01:50:38 -07:00 · 2014-07-10 01:50:38 -07:00 · 95ef3acdd3
parent 21d37b3a59
commit 95ef3acdd3
2 changed files with 88 additions and 75 deletions
--- a/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/CopyTable.java
+++ b/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/CopyTable.java
@ -159,9 +159,13 @@ public class CopyTable extends Configured implements Tool {
    System.err.println(" $ bin/hbase " +
        "org.apache.hadoop.hbase.mapreduce.CopyTable --starttime=1265875194289 --endtime=1265878794289 " +
        "--peer.adr=server1,server2,server3:2181:/hbase --families=myOldCf:myNewCf,cf2,cf3 TestTable ");
-    System.err.println("For performance consider the following general options:\n"
+    System.err.println("For performance consider the following general option:\n"
-        + "-Dhbase.client.scanner.caching=100\n"
+        + "  It is recommended that you set the following to >=100. A higher value uses more memory but\n"
-        + "-Dmapreduce.map.speculative=false");
+        + "  decreases the round trip time to the server and may increase performance.\n"
        + "    -Dhbase.client.scanner.caching=100\n"
        + "  The following should always be set to false, to prevent writing data twice, which may produce \n"
        + "  inaccurate results.\n"
        + "    -Dmapreduce.map.speculative=false");
  }
  private static boolean doCommandLine(final String[] args) {
--- a/src/main/docbkx/ops_mgt.xml
+++ b/src/main/docbkx/ops_mgt.xml
@ -188,20 +188,50 @@ private static final int ERROR_EXIT_CODE = 4;</programlisting>
    <section
      xml:id="driver">
      <title>Driver</title>
-      <para>There is a <code>Driver</code> class that is executed by the HBase jar can be used to
+      <para>Several frequently-accessed utilities are provided as <code>Driver</code> classes, and executed by
-        invoke frequently accessed utilities. For example,</para>
+        the <filename>bin/hbase</filename> command. These utilities represent MapReduce jobs which
-      <screen>HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-VERSION.jar
+        run on your cluster. They are run in the following way, replacing
-
+          <replaceable>UtilityName</replaceable> with the utility you want to run. This command
-An example program must be given as the first argument.
+        assumes you have set the environment variable <literal>HBASE_HOME</literal> to the directory
-Valid program names are:
+        where HBase is unpacked on your server.</para>
-  completebulkload: Complete a bulk data load.
+      <screen>
-  copytable: Export a table from local cluster to peer cluster
+${HBASE_HOME}/bin/hbase org.apache.hadoop.hbase.mapreduce.<replaceable>UtilityName</replaceable>        
  export: Write table data to HDFS.
  import: Import data written by Export.
  importtsv: Import data in TSV format.
  rowcounter: Count rows in HBase table
  verifyrep: Compare the data from tables in two different clusters. WARNING: It doesn't work for incrementColumnValues'd cells since the timestamp is chan
      </screen>
      <para>The following utilities are available:</para>
      <variablelist>
        <varlistentry>
          <term><command>LoadIncrementalHFiles</command></term>
          <listitem><para>Complete a bulk data load.</para></listitem>
        </varlistentry>
        <varlistentry>
          <term><command>CopyTable</command></term>
          <listitem><para>Export a table from the local cluster to a peer cluster.</para></listitem>
        </varlistentry>
        <varlistentry>
          <term><command>Export</command></term>
          <listitem><para>Write table data to HDFS.</para></listitem>
        </varlistentry>
        <varlistentry>
          <term><command>Import</command></term>
          <listitem><para>Import data written by a previous <command>Export</command> operation.</para></listitem>
        </varlistentry>
        <varlistentry>
          <term><command>ImportTsv</command></term>
          <listitem><para>Import data in TSV format.</para></listitem>
        </varlistentry>
        <varlistentry>
          <term><command>RowCounter</command></term>
          <listitem><para>Count rows in an HBase table.</para></listitem>
        </varlistentry>
        <varlistentry>
          <term><command>replication.VerifyReplication</command></term>
          <listitem><para>Compare the data from tables in two different clusters. WARNING: It
            doesn't work for incrementColumnValues'd cells since the timestamp is changed. Note that
          this command is in a different package than the others.</para></listitem>
        </varlistentry>
      </variablelist>
      <para>Each command except <command>RowCounter</command> accepts a single
        <literal>--help</literal> argument to print usage instructions.</para>
    </section>
    <section
      xml:id="hbck">
@ -266,66 +296,45 @@ Valid program names are:
      <para> CopyTable is a utility that can copy part or of all of a table, either to the same
        cluster or another cluster. The target table must first exist. The usage is as
        follows:</para>
      <screen>$ bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable [--starttime=X] [--endtime=Y] [--new.name=NEW] [--peer.adr=ADR] tablename
 </screen>
-      <variablelist>
+      <screen>
-        <title>Options</title>
+$ <userinput>./bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable --help </userinput>       
-        <varlistentry>
+/bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable --help
-          <term>starttime</term>
+Usage: CopyTable [general options] [--starttime=X] [--endtime=Y] [--new.name=NEW] [--peer.adr=ADR] &lt;tablename&gt;
-          <listitem>
+
-            <para>Beginning of the time range. Without endtime means starttime to forever.</para>
+Options:
-          </listitem>
+ rs.class     hbase.regionserver.class of the peer cluster, 
-        </varlistentry>
+              specify if different from current cluster
-        <varlistentry>
+ rs.impl      hbase.regionserver.impl of the peer cluster,
-          <term>endtime</term>
+ startrow     the start row
-          <listitem>
+ stoprow      the stop row
-            <para>End of the time range. Without endtime means starttime to forever.</para>
+ starttime    beginning of the time range (unixtime in millis)
-          </listitem>
+              without endtime means from starttime to forever
-        </varlistentry>
+ endtime      end of the time range.  Ignored if no starttime specified.
-        <varlistentry>
+ versions     number of cell versions to copy
-          <term>versions</term>
+ new.name     new table's name
-          <listitem>
+ peer.adr     Address of the peer cluster given in the format
-            <para>Number of cell versions to copy.</para>
+              hbase.zookeeer.quorum:hbase.zookeeper.client.port:zookeeper.znode.parent
-          </listitem>
+ families     comma-separated list of families to copy
-        </varlistentry>
+              To copy from cf1 to cf2, give sourceCfName:destCfName.
-        <varlistentry>
+              To keep the same name, just give "cfName"
-          <term>new.name</term>
+ all.cells    also copy delete markers and deleted cells
-          <listitem>
+
-            <para>New table's name.</para>
+Args:
-          </listitem>
+ tablename    Name of the table to copy
-        </varlistentry>
+
-        <varlistentry>
+Examples:
-          <term>peer.adr</term>
+ To copy 'TestTable' to a cluster that uses replication for a 1 hour window:
-          <listitem>
+ $ bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable --starttime=1265875194289 --endtime=1265878794289 --peer.adr=server1,server2,server3:2181:/hbase --families=myOldCf:myNewCf,cf2,cf3 TestTable
-            <para>Address of the peer cluster given in the format
+
-              hbase.zookeeper.quorum:hbase.zookeeper.client.port:zookeeper.znode.parent</para>
+For performance consider the following general options:
-          </listitem>
+  It is recommended that you set the following to >=100. A higher value uses more memory but
-        </varlistentry>
+  decreases the round trip time to the server and may increase performance.
-        <varlistentry>
+    -Dhbase.client.scanner.caching=100
-          <term>families</term>
+  The following should always be set to false, to prevent writing data twice, which may produce
-          <listitem>
+  inaccurate results.
-            <para>Comma-separated list of ColumnFamilies to copy.</para>
+    -Dmapred.map.tasks.speculative.execution=false       
-          </listitem>
+      </screen>
        </varlistentry>
        <varlistentry>
          <term>all.cells</term>
          <listitem>
            <para>Also copy delete markers and uncollected deleted cells (advanced option).</para>
          </listitem>
        </varlistentry>
      </variablelist>
      <itemizedlist>
        <title>Args:</title>
        <listitem>
          <para>tablename Name of table to copy.</para>
        </listitem>
      </itemizedlist>
      <para>Example of copying 'TestTable' to a cluster that uses replication for a 1 hour
        window:</para>
      <screen>$ bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable
 --starttime=1265875194289 --endtime=1265878794289
 --peer.adr=server1,server2,server3:2181:/hbase TestTable</screen>
      <note>
        <title>Scanner Caching</title>
        <para>Caching for the input Scan is configured via <code>hbase.client.scanner.caching</code>