HBASE-4931 [docs] CopyTable instructions could be improved (Misty Stanley-Jones)

This commit is contained in:
Jonathan M Hsieh 2014-07-10 01:50:38 -07:00
parent 21d37b3a59
commit 95ef3acdd3
2 changed files with 88 additions and 75 deletions

View File

@ -159,9 +159,13 @@ public class CopyTable extends Configured implements Tool {
System.err.println(" $ bin/hbase " + System.err.println(" $ bin/hbase " +
"org.apache.hadoop.hbase.mapreduce.CopyTable --starttime=1265875194289 --endtime=1265878794289 " + "org.apache.hadoop.hbase.mapreduce.CopyTable --starttime=1265875194289 --endtime=1265878794289 " +
"--peer.adr=server1,server2,server3:2181:/hbase --families=myOldCf:myNewCf,cf2,cf3 TestTable "); "--peer.adr=server1,server2,server3:2181:/hbase --families=myOldCf:myNewCf,cf2,cf3 TestTable ");
System.err.println("For performance consider the following general options:\n" System.err.println("For performance consider the following general option:\n"
+ "-Dhbase.client.scanner.caching=100\n" + " It is recommended that you set the following to >=100. A higher value uses more memory but\n"
+ "-Dmapreduce.map.speculative=false"); + " decreases the round trip time to the server and may increase performance.\n"
+ " -Dhbase.client.scanner.caching=100\n"
+ " The following should always be set to false, to prevent writing data twice, which may produce \n"
+ " inaccurate results.\n"
+ " -Dmapreduce.map.speculative=false");
} }
private static boolean doCommandLine(final String[] args) { private static boolean doCommandLine(final String[] args) {

View File

@ -188,20 +188,50 @@ private static final int ERROR_EXIT_CODE = 4;</programlisting>
<section <section
xml:id="driver"> xml:id="driver">
<title>Driver</title> <title>Driver</title>
<para>There is a <code>Driver</code> class that is executed by the HBase jar can be used to <para>Several frequently-accessed utilities are provided as <code>Driver</code> classes, and executed by
invoke frequently accessed utilities. For example,</para> the <filename>bin/hbase</filename> command. These utilities represent MapReduce jobs which
<screen>HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-VERSION.jar run on your cluster. They are run in the following way, replacing
<replaceable>UtilityName</replaceable> with the utility you want to run. This command
An example program must be given as the first argument. assumes you have set the environment variable <literal>HBASE_HOME</literal> to the directory
Valid program names are: where HBase is unpacked on your server.</para>
completebulkload: Complete a bulk data load. <screen>
copytable: Export a table from local cluster to peer cluster ${HBASE_HOME}/bin/hbase org.apache.hadoop.hbase.mapreduce.<replaceable>UtilityName</replaceable>
export: Write table data to HDFS.
import: Import data written by Export.
importtsv: Import data in TSV format.
rowcounter: Count rows in HBase table
verifyrep: Compare the data from tables in two different clusters. WARNING: It doesn't work for incrementColumnValues'd cells since the timestamp is chan
</screen> </screen>
<para>The following utilities are available:</para>
<variablelist>
<varlistentry>
<term><command>LoadIncrementalHFiles</command></term>
<listitem><para>Complete a bulk data load.</para></listitem>
</varlistentry>
<varlistentry>
<term><command>CopyTable</command></term>
<listitem><para>Export a table from the local cluster to a peer cluster.</para></listitem>
</varlistentry>
<varlistentry>
<term><command>Export</command></term>
<listitem><para>Write table data to HDFS.</para></listitem>
</varlistentry>
<varlistentry>
<term><command>Import</command></term>
<listitem><para>Import data written by a previous <command>Export</command> operation.</para></listitem>
</varlistentry>
<varlistentry>
<term><command>ImportTsv</command></term>
<listitem><para>Import data in TSV format.</para></listitem>
</varlistentry>
<varlistentry>
<term><command>RowCounter</command></term>
<listitem><para>Count rows in an HBase table.</para></listitem>
</varlistentry>
<varlistentry>
<term><command>replication.VerifyReplication</command></term>
<listitem><para>Compare the data from tables in two different clusters. WARNING: It
doesn't work for incrementColumnValues'd cells since the timestamp is changed. Note that
this command is in a different package than the others.</para></listitem>
</varlistentry>
</variablelist>
<para>Each command except <command>RowCounter</command> accepts a single
<literal>--help</literal> argument to print usage instructions.</para>
</section> </section>
<section <section
xml:id="hbck"> xml:id="hbck">
@ -266,66 +296,45 @@ Valid program names are:
<para> CopyTable is a utility that can copy part or of all of a table, either to the same <para> CopyTable is a utility that can copy part or of all of a table, either to the same
cluster or another cluster. The target table must first exist. The usage is as cluster or another cluster. The target table must first exist. The usage is as
follows:</para> follows:</para>
<screen>$ bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable [--starttime=X] [--endtime=Y] [--new.name=NEW] [--peer.adr=ADR] tablename
</screen>
<variablelist> <screen>
<title>Options</title> $ <userinput>./bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable --help </userinput>
<varlistentry> /bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable --help
<term>starttime</term> Usage: CopyTable [general options] [--starttime=X] [--endtime=Y] [--new.name=NEW] [--peer.adr=ADR] &lt;tablename&gt;
<listitem>
<para>Beginning of the time range. Without endtime means starttime to forever.</para> Options:
</listitem> rs.class hbase.regionserver.class of the peer cluster,
</varlistentry> specify if different from current cluster
<varlistentry> rs.impl hbase.regionserver.impl of the peer cluster,
<term>endtime</term> startrow the start row
<listitem> stoprow the stop row
<para>End of the time range. Without endtime means starttime to forever.</para> starttime beginning of the time range (unixtime in millis)
</listitem> without endtime means from starttime to forever
</varlistentry> endtime end of the time range. Ignored if no starttime specified.
<varlistentry> versions number of cell versions to copy
<term>versions</term> new.name new table's name
<listitem> peer.adr Address of the peer cluster given in the format
<para>Number of cell versions to copy.</para> hbase.zookeeer.quorum:hbase.zookeeper.client.port:zookeeper.znode.parent
</listitem> families comma-separated list of families to copy
</varlistentry> To copy from cf1 to cf2, give sourceCfName:destCfName.
<varlistentry> To keep the same name, just give "cfName"
<term>new.name</term> all.cells also copy delete markers and deleted cells
<listitem>
<para>New table's name.</para> Args:
</listitem> tablename Name of the table to copy
</varlistentry>
<varlistentry> Examples:
<term>peer.adr</term> To copy 'TestTable' to a cluster that uses replication for a 1 hour window:
<listitem> $ bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable --starttime=1265875194289 --endtime=1265878794289 --peer.adr=server1,server2,server3:2181:/hbase --families=myOldCf:myNewCf,cf2,cf3 TestTable
<para>Address of the peer cluster given in the format
hbase.zookeeper.quorum:hbase.zookeeper.client.port:zookeeper.znode.parent</para> For performance consider the following general options:
</listitem> It is recommended that you set the following to >=100. A higher value uses more memory but
</varlistentry> decreases the round trip time to the server and may increase performance.
<varlistentry> -Dhbase.client.scanner.caching=100
<term>families</term> The following should always be set to false, to prevent writing data twice, which may produce
<listitem> inaccurate results.
<para>Comma-separated list of ColumnFamilies to copy.</para> -Dmapred.map.tasks.speculative.execution=false
</listitem> </screen>
</varlistentry>
<varlistentry>
<term>all.cells</term>
<listitem>
<para>Also copy delete markers and uncollected deleted cells (advanced option).</para>
</listitem>
</varlistentry>
</variablelist>
<itemizedlist>
<title>Args:</title>
<listitem>
<para>tablename Name of table to copy.</para>
</listitem>
</itemizedlist>
<para>Example of copying 'TestTable' to a cluster that uses replication for a 1 hour
window:</para>
<screen>$ bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable
--starttime=1265875194289 --endtime=1265878794289
--peer.adr=server1,server2,server3:2181:/hbase TestTable</screen>
<note> <note>
<title>Scanner Caching</title> <title>Scanner Caching</title>
<para>Caching for the input Scan is configured via <code>hbase.client.scanner.caching</code> <para>Caching for the input Scan is configured via <code>hbase.client.scanner.caching</code>