hbase-6078. ported and refactored bulk-loading information into RefGuide

git-svn-id: https://svn.apache.org/repos/asf/hbase/trunk@1342073 13f79535-47bb-0310-9956-ffa450edef68
2012-05-23 22:09:37 +00:00 · 2012-05-23 22:09:37 +00:00 · 1f242db7f7
parent be6e4c6593
commit 1f242db7f7
3 changed files with 217 additions and 150 deletions
--- a/src/docbkx/book.xml
+++ b/src/docbkx/book.xml
@ -2372,7 +2372,118 @@ myHtd.setValue(HTableDescriptor.SPLIT_POLICY, MyCustomSplitPolicy.class.getName(
     </section>   
     </section>   <!--  bloom  -->  
     
+    </section>  <!--  regions -->
+	
+	<section xml:id="arch.bulk.load"><title>Bulk Loading</title>
+      <section xml:id="arch.bulk.load.overview"><title>Overview</title>
+      <para>
+        HBase includes several methods of loading data into tables.
+        The most straightforward method is to either use the <code>TableOutputFormat</code>
+        class from a MapReduce job, or use the normal client APIs; however,
+        these are not always the most efficient methods.
+      </para>
+      <para>
+        The bulk load feature uses a MapReduce job to output table data in HBase's internal
+        data format, and then directly loads the generated StoreFiles into a running
+        cluster. Using bulk load will use less CPU and network resources than
+        simply using the HBase API.
+      </para>
    </section>
+    <section xml:id="arch.bulk.load.arch"><title>Bulk Load Architecture</title>
+      <para>
+        The HBase bulk load process consists of two main steps.
+      </para>
+      <section xml:id="arch.bulk.load.prep"><title>Preparing data via a MapReduce job</title>
+        <para>
+          The first step of a bulk load is to generate HBase data files (StoreFiles) from
+          a MapReduce job using <code>HFileOutputFormat</code>. This output format writes
+          out data in HBase's internal storage format so that they can be
+          later loaded very efficiently into the cluster.
+        </para>
+        <para>
+          In order to function efficiently, <code>HFileOutputFormat</code> must be
+          configured such that each output HFile fits within a single region.
+          In order to do this, jobs whose output will be bulk loaded into HBase
+          use Hadoop's <code>TotalOrderPartitioner</code> class to partition the map output
+          into disjoint ranges of the key space, corresponding to the key
+          ranges of the regions in the table.
+        </para>
+        <para>
+          <code>HFileOutputFormat</code> includes a convenience function,
+          <code>configureIncrementalLoad()</code>, which automatically sets up
+          a <code>TotalOrderPartitioner</code> based on the current region boundaries of a
+          table.
+        </para>
+      </section>
+      <section xml:id="arch.bulk.load.complete"><title>Completing the data load</title>
+        <para>
+          After the data has been prepared using
+          <code>HFileOutputFormat</code>, it is loaded into the cluster using
+          <code>completebulkload</code>. This command line tool iterates
+          through the prepared data files, and for each one determines the
+          region the file belongs to. It then contacts the appropriate Region
+          Server which adopts the HFile, moving it into its storage directory
+          and making the data available to clients.
+        </para>
+        <para>
+          If the region boundaries have changed during the course of bulk load
+          preparation, or between the preparation and completion steps, the
+          <code>completebulkloads</code> utility will automatically split the
+          data files into pieces corresponding to the new boundaries. This
+          process is not optimally efficient, so users should take care to
+          minimize the delay between preparing a bulk load and importing it
+          into the cluster, especially if other clients are simultaneously
+          loading data through other means.
+        </para>
+      </section>
+    </section>
+    <section xml:id="arch.bulk.load.import"><title>Importing the prepared data using the completebulkload tool</title>
+      <para>
+        After a data import has been prepared, either by using the
+        <code>importtsv</code> tool with the
+        "<code>importtsv.bulk.output</code>" option or by some other MapReduce
+        job using the <code>HFileOutputFormat</code>, the
+        <code>completebulkload</code> tool is used to import the data into the
+        running cluster.
+      </para>
+      <para>
+        The <code>completebulkload</code> tool simply takes the output path
+        where <code>importtsv</code> or your MapReduce job put its results, and
+        the table name to import into. For example:
+      </para>
+      <code>$ hadoop jar hbase-VERSION.jar completebulkload [-c /path/to/hbase/config/hbase-site.xml] /user/todd/myoutput mytable</code>
+      <para>
+        The <code>-c config-file</code> option can be used to specify a file
+        containing the appropriate hbase parameters (e.g., hbase-site.xml) if
+        not supplied already on the CLASSPATH (In addition, the CLASSPATH must
+        contain the directory that has the zookeeper configuration file if
+        zookeeper is NOT managed by HBase).
+      </para>
+      <para>
+        Note: If the target table does not already exist in HBase, this
+        tool will create the table automatically.</para>
+      <para>
+        This tool will run quickly, after which point the new data will be visible in
+        the cluster.
+      </para>
+    </section>
+    <section xml:id="arch.bulk.load.also"><title>See Also</title>
+      <para>For more information about the referenced utilities, see <xref linkend="importtsv"/> and  <xref linkend="completebulkload"/>.
+      </para>
+    </section>
+    <section xml:id="arch.bulk.load.adv"><title>Advanced Usage</title>
+      <para>
+        Although the <code>importtsv</code> tool is useful in many cases, advanced users may
+        want to generate data programatically, or import data from other formats. To get
+        started doing so, dig into <code>ImportTsv.java</code> and check the JavaDoc for
+        HFileOutputFormat.
+      </para>
+      <para>
+        The import step of the bulk load can also be done programatically. See the
+        <code>LoadIncrementalHFiles</code> class for more information.
+      </para>
+    </section>	
+	</section>  <!--  bulk loading -->
    
    <section xml:id="arch.hdfs"><title>HDFS</title>
       <para>As HBase runs on HDFS (and each StoreFile is written as a file on HDFS),
--- a/src/docbkx/ops_mgt.xml
+++ b/src/docbkx/ops_mgt.xml
@ -36,6 +36,25 @@

    <para>Here we list HBase tools for administration, analysis, fixup, and
    debugging.</para>
+    <section xml:id="driver"><title>Driver</title>
+      <para>There is a <code>Driver</code> class that is executed by the HBase jar can be used to invoke frequently accessed utilities.  For example, 
+<programlisting>HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-VERSION.jar 
+</programlisting>
+... will return...
+<programlisting>
+An example program must be given as the first argument.
+Valid program names are:
+  completebulkload: Complete a bulk data load.
+  copytable: Export a table from local cluster to peer cluster
+  export: Write table data to HDFS.
+  import: Import data written by Export.
+  importtsv: Import data in TSV format.
+  rowcounter: Count rows in HBase table
+  verifyrep: Compare the data from tables in two different clusters. WARNING: It doesn't work for incrementColumnValues'd cells since the timestamp is chan
+</programlisting>
+... for allowable program names.
+      </para>
+    </section>
    <section xml:id="hbck">
        <title>HBase <application>hbck</application></title>
        <subtitle>An <emphasis>fsck</emphasis> for your HBase install</subtitle>
@ -133,15 +152,92 @@
    </section>
    <section xml:id="importtsv">
       <title>ImportTsv</title>
-       <para>Import is a utility that will load data in TSV format into HBase.  Invoke via:
-<programlisting>$ bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=a,b,c &lt;tablename&gt; &lt;inputdir&gt;
+       <para>ImportTsv is a utility that will load data in TSV format into HBase.  It has two distinct usages:  loading data from TSV format in HDFS 
+       into HBase via Puts, and preparing StoreFiles to be loaded via the <code>completebulkload</code>.
+       </para>
+       <para>To load data via Puts (i.e., non-bulk loading):
+<programlisting>$ bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=a,b,c &lt;tablename&gt; &lt;hdfs-inputdir&gt;
 </programlisting>
       </para>
+       <para>To generate StoreFiles for bulk-loading:
+<programlisting>$ bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=a,b,c -Dimporttsv.bulk.output=hdfs://storefile-outputdir &lt;tablename&gt; &lt;hdfs-data-inputdir&gt;
+</programlisting>
+       </para>
+       <para>These generated StoreFiles can be loaded into HBase via <xref linkend="completebulkload"/>. 
+       </para>
+       <section xml:id="importtsv.options"><title>ImportTsv Options</title>
+       Running ImportTsv with no arguments prints brief usage information:
+<programlisting>
+Usage: importtsv -Dimporttsv.columns=a,b,c &lt;tablename&gt; &lt;inputdir&gt;
+
+Imports the given input directory of TSV data into the specified table.
+
+The column names of the TSV data must be specified using the -Dimporttsv.columns
+option. This option takes the form of comma-separated column names, where each
+column name is either a simple column family, or a columnfamily:qualifier. The special
+column name HBASE_ROW_KEY is used to designate that this column should be used
+as the row key for each imported record. You must specify exactly one column
+to be the row key, and you must specify a column name for every column that exists in the
+input data.
+
+By default importtsv will load data directly into HBase. To instead generate
+HFiles of data to prepare for a bulk data load, pass the option:
+  -Dimporttsv.bulk.output=/path/for/output
+  Note: if you do not use this option, then the target table must already exist in HBase
+
+Other options that may be specified with -D include:
+  -Dimporttsv.skip.bad.lines=false - fail if encountering an invalid line
+  '-Dimporttsv.separator=|' - eg separate on pipes instead of tabs
+  -Dimporttsv.timestamp=currentTimeAsLong - use the specified timestamp for the import
+  -Dimporttsv.mapper.class=my.Mapper - A user-defined Mapper to use instead of org.apache.hadoop.hbase.mapreduce.TsvImporterMapper
+</programlisting>       
+       </section>
+       <section xml:id="importtsv.example"><title>ImportTsv Example</title>
+         <para>For example, assume that we are loading data into a table called 'datatsv' with a ColumnFamily called 'd' with two columns "c1" and "c2".
+         </para>
+         <para>Assume that an input file exists as follows:
+<programlisting>
+row1	c1	c2
+row2	c1	c2
+row3	c1	c2
+row4	c1	c2
+row5	c1	c2
+row6	c1	c2
+row7	c1	c2
+row8	c1	c2
+row9	c1	c2
+row10	c1	c2
+</programlisting>
+         </para>
+         <para>For ImportTsv to use this imput file, the command line needs to look like this:
+ <programlisting>
+ HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-VERSION.jar importtsv -Dimporttsv.columns=HBASE_ROW_KEY,d:c1,d:c2 -Dimporttsv.bulk.output=hdfs://storefileoutput  datatsv hdfs://inputfile
+ </programlisting>
+         ... and in this example the first column is the rowkey, which is why the HBASE_ROW_KEY is used.  The second and third columns in the file will be imported as "d:c1" and "d:c2", respectively.
+         </para>
+       </section>
+       <section xml:id="importtsv.warning"><title>ImportTsv Warning</title>
+         <para>If you have preparing a lot of data for bulk loading, make sure the target HBase table is pre-split appropriately.
+         </para>
+       </section>
+       <section xml:id="importtsv.also"><title>See Also</title>
+       For more information about bulk-loading HFiles into HBase, see <xref linkend="arch.bulk.load"/>
+       </section>       
    </section>
-    <section xml:id="bulk.loading">
-       <title>Bulk Loading</title>
-       <para>For imformation about bulk-loading HFiles into HBase, see <link xlink:href="http://hbase.apache.org/bulk-loads.html">Bulk Loads</link>.
-       This page currently exists on the website and will eventually be migrated into the RefGuide.
+    
+    <section xml:id="completebulkload">
+       <title>CompleteBulkLoad</title>
+	   <para>The <code>completebulkload</code> utility will move generated StoreFiles into an HBase table.  This utility is often used
+	   in conjunction with output from <xref linkend="importtsv"/>.  
+	   </para>
+	   <para>There are two ways to invoke this utility, with explicit classname and via the driver: 
+<programlisting>$ bin/hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFile &lt;hdfs://storefileoutput&gt; &lt;tablename&gt;
+</programlisting>
+.. and via the Driver..
+<programlisting>HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-VERSION.jar completebulkload &lt;hdfs://storefileoutput&gt; &lt;tablename&gt;
+</programlisting>
+	  </para>
+       <para>For more information about bulk-loading HFiles into HBase, see <xref linkend="arch.bulk.load"/>.
       </para>
    </section>
    <section xml:id="walplayer">
--- a/src/site/xdoc/bulk-loads.xml
+++ b/src/site/xdoc/bulk-loads.xml
@ -23,149 +23,9 @@
    </title>
  </properties>
  <body>
-    <section name="Overview">
-      <p>
-        HBase includes several methods of loading data into tables.
-        The most straightforward method is to either use the TableOutputFormat
-        class from a MapReduce job, or use the normal client APIs; however,
-        these are not always the most efficient methods.
-      </p>
-      <p>
-        This document describes HBase's bulk load functionality. The bulk load
-        feature uses a MapReduce job to output table data in HBase's internal
-        data format, and then directly loads the data files into a running
-        cluster. Using bulk load will use less CPU and network resources than
-        simply using the HBase API.
-      </p>
-    </section>
-    <section name="Bulk Load Architecture">
-      <p>
-        The HBase bulk load process consists of two main steps.
-      </p>
-      <section name="Preparing data via a MapReduce job">
-        <p>
-          The first step of a bulk load is to generate HBase data files from
-          a MapReduce job using HFileOutputFormat. This output format writes
-          out data in HBase's internal storage format so that they can be
-          later loaded very efficiently into the cluster.
-        </p>
-        <p>
-          In order to function efficiently, HFileOutputFormat must be
-          configured such that each output HFile fits within a single region.
-          In order to do this, jobs whose output will be bulk loaded into HBase
-          use Hadoop's TotalOrderPartitioner class to partition the map output
-          into disjoint ranges of the key space, corresponding to the key
-          ranges of the regions in the table.
-        </p>
-        <p>
-          HFileOutputFormat includes a convenience function,
-          <code>configureIncrementalLoad()</code>, which automatically sets up
-          a TotalOrderPartitioner based on the current region boundaries of a
-          table.
-        </p>
-      </section>
-      <section name="Completing the data load">
-        <p>
-          After the data has been prepared using
-          <code>HFileOutputFormat</code>, it is loaded into the cluster using
-          <code>completebulkload</code>. This command line tool iterates
-          through the prepared data files, and for each one determines the
-          region the file belongs to. It then contacts the appropriate Region
-          Server which adopts the HFile, moving it into its storage directory
-          and making the data available to clients.
-        </p>
-        <p>
-          If the region boundaries have changed during the course of bulk load
-          preparation, or between the preparation and completion steps, the
-          <code>completebulkloads</code> utility will automatically split the
-          data files into pieces corresponding to the new boundaries. This
-          process is not optimally efficient, so users should take care to
-          minimize the delay between preparing a bulk load and importing it
-          into the cluster, especially if other clients are simultaneously
-          loading data through other means.
-        </p>
-      </section>
-    </section>
-    <section name="Importing the prepared data using the completebulkload tool">
-      <p>
-        After a data import has been prepared, either by using the
-        <code>importtsv</code> tool with the
-        "<code>importtsv.bulk.output</code>" option or by some other MapReduce
-        job using the <code>HFileOutputFormat</code>, the
-        <code>completebulkload</code> tool is used to import the data into the
-        running cluster.
-      </p>
-      <p>
-        The <code>completebulkload</code> tool simply takes the output path
-        where <code>importtsv</code> or your MapReduce job put its results, and
-        the table name to import into. For example:
-      </p>
-      <code>$ hadoop jar hbase-VERSION.jar completebulkload [-c /path/to/hbase/config/hbase-site.xml] /user/todd/myoutput mytable</code>
-      <p>
-        The <code>-c config-file</code> option can be used to specify a file
-        containing the appropriate hbase parameters (e.g., hbase-site.xml) if
-        not supplied already on the CLASSPATH (In addition, the CLASSPATH must
-        contain the directory that has the zookeeper configuration file if
-        zookeeper is NOT managed by HBase).
-      </p>
-      <p>
-        <b>Note:</b> If the target table does not already exist in HBase, this
-        tool will create the table automatically.</p>
-      <p>
-        This tool will run quickly, after which point the new data will be visible in
-        the cluster.
-      </p>
-    </section>
-    <section name="Using the importtsv tool to bulk load data">
-      <p>
-        HBase ships with a command line tool called <code>importtsv</code>
-        which when given files containing data in TSV form can prepare this
-        data for bulk import into HBase. This tool by default uses the HBase
-        <code>put</code> API to insert data into HBase one row at a time, but
-        when the "<code>importtsv.bulk.output</code>" option is used,
-        <code>importtsv</code> will instead generate files using
-        <code>HFileOutputFormat</code> which can subsequently be bulk-loaded
-        into HBase using the <code>completebulkload</code> tool described
-        above. This tool is available by running "<code>hadoop jar
-          /path/to/hbase-VERSION.jar importtsv</code>". Running this command
-        with no arguments prints brief usage information:
-      </p>
-      <code><pre>
-Usage: importtsv -Dimporttsv.columns=a,b,c &lt;tablename&gt; &lt;inputdir&gt;
-
-Imports the given input directory of TSV data into the specified table.
-
-The column names of the TSV data must be specified using the -Dimporttsv.columns
-option. This option takes the form of comma-separated column names, where each
-column name is either a simple column family, or a columnfamily:qualifier. The special
-column name HBASE_ROW_KEY is used to designate that this column should be used
-as the row key for each imported record. You must specify exactly one column
-to be the row key, and you must specify a column name for every column that exists in the
-input data.
-
-By default importtsv will load data directly into HBase. To instead generate
-HFiles of data to prepare for a bulk data load, pass the option:
-  -Dimporttsv.bulk.output=/path/for/output
-  Note: if you do not use this option, then the target table must already exist in HBase
-
-Other options that may be specified with -D include:
-  -Dimporttsv.skip.bad.lines=false - fail if encountering an invalid line
-  '-Dimporttsv.separator=|' - eg separate on pipes instead of tabs
-  -Dimporttsv.timestamp=currentTimeAsLong - use the specified timestamp for the import
-  -Dimporttsv.mapper.class=my.Mapper - A user-defined Mapper to use instead of org.apache.hadoop.hbase.mapreduce.TsvImporterMapper
-</pre></code>
-    </section>
-    <section name="Advanced Usage">
-      <p>
-        Although the <code>importtsv</code> tool is useful in many cases, advanced users may
-        want to generate data programatically, or import data from other formats. To get
-        started doing so, dig into <code>ImportTsv.java</code> and check the JavaDoc for
-        HFileOutputFormat.
-      </p>
-      <p>
-        The import step of the bulk load can also be done programatically. See the
-        <code>LoadIncrementalHFiles</code> class for more information.
-      </p>
-    </section>
+       <p>This page has been retired.  The contents have been moved to the 
+      <a href="http://hbase.apache.org/book.html#arch.bulk.load">Bulk Loading</a> section
+ in the Reference Guide.
+ </p>
  </body>
 </document>