HBASE-3240 Improve documentation of importtsv and bulk loads.

git-svn-id: https://svn.apache.org/repos/asf/hbase/trunk@1143232 13f79535-47bb-0310-9956-ffa450edef68
2011-07-06 00:09:55 +00:00 · 2011-07-06 00:09:55 +00:00 · 8b2948bd62
parent 1c151dcfac
commit 8b2948bd62
3 changed files with 77 additions and 51 deletions
--- a/CHANGES.txt
+++ b/CHANGES.txt
@ -296,6 +296,8 @@ Release 0.91.0 - Unreleased
               (Mingjie Lai via garyh)
   HBASE-4036  Implementing a MultipleColumnPrefixFilter (Anirudh Todi)
   HBASE-4048  [Coprocessors] Support configuration of coprocessor at load time
+   HBASE-3240  Improve documentation of importtsv and bulk loads.
+               (Aaron T. Myers via todd)

  TASKS
   HBASE-3559  Move report of split to master OFF the heartbeat channel
--- a/src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java
+++ b/src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java
@ -254,11 +254,13 @@ public class ImportTsv {
      "column name is either a simple column family, or a columnfamily:qualifier. The special\n" +
      "column name HBASE_ROW_KEY is used to designate that this column should be used\n" +
      "as the row key for each imported record. You must specify exactly one column\n" +
-      "to be the row key.\n" +
+      "to be the row key, and you must specify a column name for every column that exists in the\n" +
+      "input data.\n" +
      "\n" +
      "By default importtsv will load data directly into HBase. To instead generate\n" +
      "HFiles of data to prepare for a bulk data load, pass the option:\n" +
      "  -D" + BULK_OUTPUT_CONF_KEY + "=/path/for/output\n" +
+      "  Note: if you do not use this option, then the target table must already exist in HBase\n" +
      "\n" +
      "Other options that may be specified with -D include:\n" +
      "  -D" + SKIP_LINES_CONF_KEY + "=false - fail if encountering an invalid line\n" +
--- a/src/site/xdoc/bulk-loads.xml
+++ b/src/site/xdoc/bulk-loads.xml
@ -34,8 +34,8 @@
        This document describes HBase's bulk load functionality. The bulk load
        feature uses a MapReduce job to output table data in HBase's internal
        data format, and then directly loads the data files into a running
-        cluster.  Using bulk load will use less CPU and network than going
-        via the HBase API.
+        cluster. Using bulk load will use less CPU and network resources than
+        simply using the HBase API.
      </p>
    </section>
    <section name="Bulk Load Architecture">
@ -50,43 +50,85 @@
          later loaded very efficiently into the cluster.
        </p>
        <p>
-          In order to function efficiently, HFileOutputFormat must be configured
-          such that each output HFile fits within a single region. In order to
-          do this, jobs use Hadoop's TotalOrderPartitioner class to partition the
-          map output into disjoint ranges of the key space, corresponding to the
-          key ranges of the regions in the table.
+          In order to function efficiently, HFileOutputFormat must be
+          configured such that each output HFile fits within a single region.
+          In order to do this, jobs whose output will be bulk loaded into HBase
+          use Hadoop's TotalOrderPartitioner class to partition the map output
+          into disjoint ranges of the key space, corresponding to the key
+          ranges of the regions in the table.
        </p>
        <p>
-          HFileOutputFormat includes a convenience function, <code>configureIncrementalLoad()</code>,
-          which automatically sets up a TotalOrderPartitioner based on the current
-          region boundaries of a table.
+          HFileOutputFormat includes a convenience function,
+          <code>configureIncrementalLoad()</code>, which automatically sets up
+          a TotalOrderPartitioner based on the current region boundaries of a
+          table.
        </p>
      </section>
      <section name="Completing the data load">
        <p>
-          After the data has been prepared using <code>HFileOutputFormat</code>, it
-          is loaded into the cluster using a command line tool. This command line tool
-          iterates through the prepared data files, and for each one determines the
-          region the file belongs to. It then contacts the appropriate Region Server
-          which adopts the HFile, moving it into its storage directory and making
-          the data available to clients.
+          After the data has been prepared using
+          <code>HFileOutputFormat</code>, it is loaded into the cluster using
+          <code>completebulkload</code>. This command line tool iterates
+          through the prepared data files, and for each one determines the
+          region the file belongs to. It then contacts the appropriate Region
+          Server which adopts the HFile, moving it into its storage directory
+          and making the data available to clients.
        </p>
        <p>
          If the region boundaries have changed during the course of bulk load
-          preparation, or between the preparation and completion steps, the bulk
-          load commandline utility will automatically split the data files into
-          pieces corresponding to the new boundaries. This process is not
-          optimally efficient, so users should take care to minimize the delay between
-          preparing a bulk load and importing it into the cluster, especially
-          if other clients are simultaneously loading data through other means.
+          preparation, or between the preparation and completion steps, the
+          <code>completebulkloads</code> utility will automatically split the
+          data files into pieces corresponding to the new boundaries. This
+          process is not optimally efficient, so users should take care to
+          minimize the delay between preparing a bulk load and importing it
+          into the cluster, especially if other clients are simultaneously
+          loading data through other means.
        </p>
      </section>
    </section>
-    <section name="Preparing a bulk load using the importtsv tool">
+    <section name="Importing the prepared data using the completebulkload tool">
      <p>
-        HBase ships with a command line tool called <code>importtsv</code>. This tool
-        is available by running <code>hadoop jar /path/to/hbase-VERSION.jar importtsv</code>.
-        Running this tool with no arguments prints brief usage information:
+        After a data import has been prepared, either by using the
+        <code>importtsv</code> tool with the
+        "<code>importtsv.bulk.output</code>" option or by some other MapReduce
+        job using the <code>HFileOutputFormat</code>, the
+        <code>completebulkload</code> tool is used to import the data into the
+        running cluster.
+      </p>
+      <p>
+        The <code>completebulkload</code> tool simply takes the output path
+        where <code>importtsv</code> or your MapReduce job put its results, and
+        the table name to import into. For example:
+      </p>
+      <code>$ hadoop jar hbase-VERSION.jar completebulkload [-c /path/to/hbase/config/hbase-site.xml] /user/todd/myoutput mytable</code>
+      <p>
+        The <code>-c config-file</code> option can be used to specify a file
+        containing the appropriate hbase parameters (e.g., hbase-site.xml) if
+        not supplied already on the CLASSPATH (In addition, the CLASSPATH must
+        contain the directory that has the zookeeper configuration file if
+        zookeeper is NOT managed by HBase).
+      </p>
+      <p>
+        <b>Note:</b> If the target table does not already exist in HBase, this
+        tool will create the table automatically.</p>
+      <p>
+        This tool will run quickly, after which point the new data will be visible in
+        the cluster.
+      </p>
+    </section>
+    <section name="Using the importtsv tool to bulk load data">
+      <p>
+        HBase ships with a command line tool called <code>importtsv</code>
+        which when given files containing data in TSV form can prepare this
+        data for bulk import into HBase. This tool by default uses the HBase
+        <code>put</code> API to insert data into HBase one row at a time, but
+        when the "<code>importtsv.bulk.output</code>" option is used,
+        <code>importtsv</code> will instead generate files using
+        <code>HFileOutputFormat</code> which can subsequently be bulk-loaded
+        into HBase using the <code>completebulkload</code> tool described
+        above. This tool is available by running "<code>hadoop jar
+          /path/to/hbase-VERSION.jar importtsv</code>". Running this command
+        with no arguments prints brief usage information:
      </p>
      <code><pre>
 Usage: importtsv -Dimporttsv.columns=a,b,c &lt;tablename&gt; &lt;inputdir&gt;
@ -98,41 +140,21 @@ option. This option takes the form of comma-separated column names, where each
 column name is either a simple column family, or a columnfamily:qualifier. The special
 column name HBASE_ROW_KEY is used to designate that this column should be used
 as the row key for each imported record. You must specify exactly one column
-to be the row key.
+to be the row key, and you must specify a column name for every column that exists in the
+input data.

 By default importtsv will load data directly into HBase. To instead generate
 HFiles of data to prepare for a bulk data load, pass the option:
  -Dimporttsv.bulk.output=/path/for/output
+  Note: if you do not use this option, then the target table must already exist in HBase

 Other options that may be specified with -D include:
  -Dimporttsv.skip.bad.lines=false - fail if encountering an invalid line
-  -Dimporttsv.timestamp=currentTimeAsLong - use the specified timestamp for the import
  '-Dimporttsv.separator=|' - eg separate on pipes instead of tabs
  -Dimporttsv.timestamp=currentTimeAsLong - use the specified timestamp for the import
-  -Dimporttsv.mapper.class=my.Mapper - A user-defined Mapper to use instead of TsvImporterMapper
-
+  -Dimporttsv.mapper.class=my.Mapper - A user-defined Mapper to use instead of org.apache.hadoop.hbase.mapreduce.TsvImporterMapper
 </pre></code>
    </section>
-    <section name="Importing the prepared data using the completebulkload tool">
-      <p>
-        After a data import has been prepared using the <code>importtsv</code> tool, the
-        <code>completebulkload</code> tool is used to import the data into the running cluster.
-      </p>
-      <p>
-        The <code>completebulkload</code> tool simply takes the same output path where
-        <code>importtsv</code> put its results, and the table name. For example:
-      </p>
-      <code>$ hadoop jar hbase-VERSION.jar completebulkload [-c /path/to/hbase/config/hbase-site.xml] /user/todd/myoutput mytable</code>
-      <p>The <code>-c config-file</code> option can be used to specify a file containing the
-          appropriate hbase parameters (e.g., hbase-site.xml) if not supplied already on
-          the CLASSPATH (In addition, the CLASSPATH must contain the directory that has
-          the zookeeper configuration file if zookeeper is NOT managed by HBase).
-      </p>
-      <p>
-        This tool will run quickly, after which point the new data will be visible in
-        the cluster.
-      </p>
-    </section>
    <section name="Advanced Usage">
      <p>
        Although the <code>importtsv</code> tool is useful in many cases, advanced users may