HBASE-3240 Improve documentation of importtsv and bulk loads.

git-svn-id: https://svn.apache.org/repos/asf/hbase/trunk@1143232 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
Todd Lipcon 2011-07-06 00:09:55 +00:00
parent 1c151dcfac
commit 8b2948bd62
3 changed files with 77 additions and 51 deletions

View File

@ -296,6 +296,8 @@ Release 0.91.0 - Unreleased
(Mingjie Lai via garyh)
HBASE-4036 Implementing a MultipleColumnPrefixFilter (Anirudh Todi)
HBASE-4048 [Coprocessors] Support configuration of coprocessor at load time
HBASE-3240 Improve documentation of importtsv and bulk loads.
(Aaron T. Myers via todd)
TASKS
HBASE-3559 Move report of split to master OFF the heartbeat channel

View File

@ -254,11 +254,13 @@ public class ImportTsv {
"column name is either a simple column family, or a columnfamily:qualifier. The special\n" +
"column name HBASE_ROW_KEY is used to designate that this column should be used\n" +
"as the row key for each imported record. You must specify exactly one column\n" +
"to be the row key.\n" +
"to be the row key, and you must specify a column name for every column that exists in the\n" +
"input data.\n" +
"\n" +
"By default importtsv will load data directly into HBase. To instead generate\n" +
"HFiles of data to prepare for a bulk data load, pass the option:\n" +
" -D" + BULK_OUTPUT_CONF_KEY + "=/path/for/output\n" +
" Note: if you do not use this option, then the target table must already exist in HBase\n" +
"\n" +
"Other options that may be specified with -D include:\n" +
" -D" + SKIP_LINES_CONF_KEY + "=false - fail if encountering an invalid line\n" +

View File

@ -34,8 +34,8 @@
This document describes HBase's bulk load functionality. The bulk load
feature uses a MapReduce job to output table data in HBase's internal
data format, and then directly loads the data files into a running
cluster. Using bulk load will use less CPU and network than going
via the HBase API.
cluster. Using bulk load will use less CPU and network resources than
simply using the HBase API.
</p>
</section>
<section name="Bulk Load Architecture">
@ -50,43 +50,85 @@
later loaded very efficiently into the cluster.
</p>
<p>
In order to function efficiently, HFileOutputFormat must be configured
such that each output HFile fits within a single region. In order to
do this, jobs use Hadoop's TotalOrderPartitioner class to partition the
map output into disjoint ranges of the key space, corresponding to the
key ranges of the regions in the table.
In order to function efficiently, HFileOutputFormat must be
configured such that each output HFile fits within a single region.
In order to do this, jobs whose output will be bulk loaded into HBase
use Hadoop's TotalOrderPartitioner class to partition the map output
into disjoint ranges of the key space, corresponding to the key
ranges of the regions in the table.
</p>
<p>
HFileOutputFormat includes a convenience function, <code>configureIncrementalLoad()</code>,
which automatically sets up a TotalOrderPartitioner based on the current
region boundaries of a table.
HFileOutputFormat includes a convenience function,
<code>configureIncrementalLoad()</code>, which automatically sets up
a TotalOrderPartitioner based on the current region boundaries of a
table.
</p>
</section>
<section name="Completing the data load">
<p>
After the data has been prepared using <code>HFileOutputFormat</code>, it
is loaded into the cluster using a command line tool. This command line tool
iterates through the prepared data files, and for each one determines the
region the file belongs to. It then contacts the appropriate Region Server
which adopts the HFile, moving it into its storage directory and making
the data available to clients.
After the data has been prepared using
<code>HFileOutputFormat</code>, it is loaded into the cluster using
<code>completebulkload</code>. This command line tool iterates
through the prepared data files, and for each one determines the
region the file belongs to. It then contacts the appropriate Region
Server which adopts the HFile, moving it into its storage directory
and making the data available to clients.
</p>
<p>
If the region boundaries have changed during the course of bulk load
preparation, or between the preparation and completion steps, the bulk
load commandline utility will automatically split the data files into
pieces corresponding to the new boundaries. This process is not
optimally efficient, so users should take care to minimize the delay between
preparing a bulk load and importing it into the cluster, especially
if other clients are simultaneously loading data through other means.
preparation, or between the preparation and completion steps, the
<code>completebulkloads</code> utility will automatically split the
data files into pieces corresponding to the new boundaries. This
process is not optimally efficient, so users should take care to
minimize the delay between preparing a bulk load and importing it
into the cluster, especially if other clients are simultaneously
loading data through other means.
</p>
</section>
</section>
<section name="Preparing a bulk load using the importtsv tool">
<section name="Importing the prepared data using the completebulkload tool">
<p>
HBase ships with a command line tool called <code>importtsv</code>. This tool
is available by running <code>hadoop jar /path/to/hbase-VERSION.jar importtsv</code>.
Running this tool with no arguments prints brief usage information:
After a data import has been prepared, either by using the
<code>importtsv</code> tool with the
"<code>importtsv.bulk.output</code>" option or by some other MapReduce
job using the <code>HFileOutputFormat</code>, the
<code>completebulkload</code> tool is used to import the data into the
running cluster.
</p>
<p>
The <code>completebulkload</code> tool simply takes the output path
where <code>importtsv</code> or your MapReduce job put its results, and
the table name to import into. For example:
</p>
<code>$ hadoop jar hbase-VERSION.jar completebulkload [-c /path/to/hbase/config/hbase-site.xml] /user/todd/myoutput mytable</code>
<p>
The <code>-c config-file</code> option can be used to specify a file
containing the appropriate hbase parameters (e.g., hbase-site.xml) if
not supplied already on the CLASSPATH (In addition, the CLASSPATH must
contain the directory that has the zookeeper configuration file if
zookeeper is NOT managed by HBase).
</p>
<p>
<b>Note:</b> If the target table does not already exist in HBase, this
tool will create the table automatically.</p>
<p>
This tool will run quickly, after which point the new data will be visible in
the cluster.
</p>
</section>
<section name="Using the importtsv tool to bulk load data">
<p>
HBase ships with a command line tool called <code>importtsv</code>
which when given files containing data in TSV form can prepare this
data for bulk import into HBase. This tool by default uses the HBase
<code>put</code> API to insert data into HBase one row at a time, but
when the "<code>importtsv.bulk.output</code>" option is used,
<code>importtsv</code> will instead generate files using
<code>HFileOutputFormat</code> which can subsequently be bulk-loaded
into HBase using the <code>completebulkload</code> tool described
above. This tool is available by running "<code>hadoop jar
/path/to/hbase-VERSION.jar importtsv</code>". Running this command
with no arguments prints brief usage information:
</p>
<code><pre>
Usage: importtsv -Dimporttsv.columns=a,b,c &lt;tablename&gt; &lt;inputdir&gt;
@ -98,41 +140,21 @@ option. This option takes the form of comma-separated column names, where each
column name is either a simple column family, or a columnfamily:qualifier. The special
column name HBASE_ROW_KEY is used to designate that this column should be used
as the row key for each imported record. You must specify exactly one column
to be the row key.
to be the row key, and you must specify a column name for every column that exists in the
input data.
By default importtsv will load data directly into HBase. To instead generate
HFiles of data to prepare for a bulk data load, pass the option:
-Dimporttsv.bulk.output=/path/for/output
Note: if you do not use this option, then the target table must already exist in HBase
Other options that may be specified with -D include:
-Dimporttsv.skip.bad.lines=false - fail if encountering an invalid line
-Dimporttsv.timestamp=currentTimeAsLong - use the specified timestamp for the import
'-Dimporttsv.separator=|' - eg separate on pipes instead of tabs
-Dimporttsv.timestamp=currentTimeAsLong - use the specified timestamp for the import
-Dimporttsv.mapper.class=my.Mapper - A user-defined Mapper to use instead of TsvImporterMapper
-Dimporttsv.mapper.class=my.Mapper - A user-defined Mapper to use instead of org.apache.hadoop.hbase.mapreduce.TsvImporterMapper
</pre></code>
</section>
<section name="Importing the prepared data using the completebulkload tool">
<p>
After a data import has been prepared using the <code>importtsv</code> tool, the
<code>completebulkload</code> tool is used to import the data into the running cluster.
</p>
<p>
The <code>completebulkload</code> tool simply takes the same output path where
<code>importtsv</code> put its results, and the table name. For example:
</p>
<code>$ hadoop jar hbase-VERSION.jar completebulkload [-c /path/to/hbase/config/hbase-site.xml] /user/todd/myoutput mytable</code>
<p>The <code>-c config-file</code> option can be used to specify a file containing the
appropriate hbase parameters (e.g., hbase-site.xml) if not supplied already on
the CLASSPATH (In addition, the CLASSPATH must contain the directory that has
the zookeeper configuration file if zookeeper is NOT managed by HBase).
</p>
<p>
This tool will run quickly, after which point the new data will be visible in
the cluster.
</p>
</section>
<section name="Advanced Usage">
<p>
Although the <code>importtsv</code> tool is useful in many cases, advanced users may