HBASE-3240 Improve documentation of importtsv and bulk loads.
git-svn-id: https://svn.apache.org/repos/asf/hbase/trunk@1143232 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
parent
1c151dcfac
commit
8b2948bd62
|
@ -296,6 +296,8 @@ Release 0.91.0 - Unreleased
|
|||
(Mingjie Lai via garyh)
|
||||
HBASE-4036 Implementing a MultipleColumnPrefixFilter (Anirudh Todi)
|
||||
HBASE-4048 [Coprocessors] Support configuration of coprocessor at load time
|
||||
HBASE-3240 Improve documentation of importtsv and bulk loads.
|
||||
(Aaron T. Myers via todd)
|
||||
|
||||
TASKS
|
||||
HBASE-3559 Move report of split to master OFF the heartbeat channel
|
||||
|
|
|
@ -254,11 +254,13 @@ public class ImportTsv {
|
|||
"column name is either a simple column family, or a columnfamily:qualifier. The special\n" +
|
||||
"column name HBASE_ROW_KEY is used to designate that this column should be used\n" +
|
||||
"as the row key for each imported record. You must specify exactly one column\n" +
|
||||
"to be the row key.\n" +
|
||||
"to be the row key, and you must specify a column name for every column that exists in the\n" +
|
||||
"input data.\n" +
|
||||
"\n" +
|
||||
"By default importtsv will load data directly into HBase. To instead generate\n" +
|
||||
"HFiles of data to prepare for a bulk data load, pass the option:\n" +
|
||||
" -D" + BULK_OUTPUT_CONF_KEY + "=/path/for/output\n" +
|
||||
" Note: if you do not use this option, then the target table must already exist in HBase\n" +
|
||||
"\n" +
|
||||
"Other options that may be specified with -D include:\n" +
|
||||
" -D" + SKIP_LINES_CONF_KEY + "=false - fail if encountering an invalid line\n" +
|
||||
|
|
|
@ -34,8 +34,8 @@
|
|||
This document describes HBase's bulk load functionality. The bulk load
|
||||
feature uses a MapReduce job to output table data in HBase's internal
|
||||
data format, and then directly loads the data files into a running
|
||||
cluster. Using bulk load will use less CPU and network than going
|
||||
via the HBase API.
|
||||
cluster. Using bulk load will use less CPU and network resources than
|
||||
simply using the HBase API.
|
||||
</p>
|
||||
</section>
|
||||
<section name="Bulk Load Architecture">
|
||||
|
@ -50,43 +50,85 @@
|
|||
later loaded very efficiently into the cluster.
|
||||
</p>
|
||||
<p>
|
||||
In order to function efficiently, HFileOutputFormat must be configured
|
||||
such that each output HFile fits within a single region. In order to
|
||||
do this, jobs use Hadoop's TotalOrderPartitioner class to partition the
|
||||
map output into disjoint ranges of the key space, corresponding to the
|
||||
key ranges of the regions in the table.
|
||||
In order to function efficiently, HFileOutputFormat must be
|
||||
configured such that each output HFile fits within a single region.
|
||||
In order to do this, jobs whose output will be bulk loaded into HBase
|
||||
use Hadoop's TotalOrderPartitioner class to partition the map output
|
||||
into disjoint ranges of the key space, corresponding to the key
|
||||
ranges of the regions in the table.
|
||||
</p>
|
||||
<p>
|
||||
HFileOutputFormat includes a convenience function, <code>configureIncrementalLoad()</code>,
|
||||
which automatically sets up a TotalOrderPartitioner based on the current
|
||||
region boundaries of a table.
|
||||
HFileOutputFormat includes a convenience function,
|
||||
<code>configureIncrementalLoad()</code>, which automatically sets up
|
||||
a TotalOrderPartitioner based on the current region boundaries of a
|
||||
table.
|
||||
</p>
|
||||
</section>
|
||||
<section name="Completing the data load">
|
||||
<p>
|
||||
After the data has been prepared using <code>HFileOutputFormat</code>, it
|
||||
is loaded into the cluster using a command line tool. This command line tool
|
||||
iterates through the prepared data files, and for each one determines the
|
||||
region the file belongs to. It then contacts the appropriate Region Server
|
||||
which adopts the HFile, moving it into its storage directory and making
|
||||
the data available to clients.
|
||||
After the data has been prepared using
|
||||
<code>HFileOutputFormat</code>, it is loaded into the cluster using
|
||||
<code>completebulkload</code>. This command line tool iterates
|
||||
through the prepared data files, and for each one determines the
|
||||
region the file belongs to. It then contacts the appropriate Region
|
||||
Server which adopts the HFile, moving it into its storage directory
|
||||
and making the data available to clients.
|
||||
</p>
|
||||
<p>
|
||||
If the region boundaries have changed during the course of bulk load
|
||||
preparation, or between the preparation and completion steps, the bulk
|
||||
load commandline utility will automatically split the data files into
|
||||
pieces corresponding to the new boundaries. This process is not
|
||||
optimally efficient, so users should take care to minimize the delay between
|
||||
preparing a bulk load and importing it into the cluster, especially
|
||||
if other clients are simultaneously loading data through other means.
|
||||
preparation, or between the preparation and completion steps, the
|
||||
<code>completebulkloads</code> utility will automatically split the
|
||||
data files into pieces corresponding to the new boundaries. This
|
||||
process is not optimally efficient, so users should take care to
|
||||
minimize the delay between preparing a bulk load and importing it
|
||||
into the cluster, especially if other clients are simultaneously
|
||||
loading data through other means.
|
||||
</p>
|
||||
</section>
|
||||
</section>
|
||||
<section name="Preparing a bulk load using the importtsv tool">
|
||||
<section name="Importing the prepared data using the completebulkload tool">
|
||||
<p>
|
||||
HBase ships with a command line tool called <code>importtsv</code>. This tool
|
||||
is available by running <code>hadoop jar /path/to/hbase-VERSION.jar importtsv</code>.
|
||||
Running this tool with no arguments prints brief usage information:
|
||||
After a data import has been prepared, either by using the
|
||||
<code>importtsv</code> tool with the
|
||||
"<code>importtsv.bulk.output</code>" option or by some other MapReduce
|
||||
job using the <code>HFileOutputFormat</code>, the
|
||||
<code>completebulkload</code> tool is used to import the data into the
|
||||
running cluster.
|
||||
</p>
|
||||
<p>
|
||||
The <code>completebulkload</code> tool simply takes the output path
|
||||
where <code>importtsv</code> or your MapReduce job put its results, and
|
||||
the table name to import into. For example:
|
||||
</p>
|
||||
<code>$ hadoop jar hbase-VERSION.jar completebulkload [-c /path/to/hbase/config/hbase-site.xml] /user/todd/myoutput mytable</code>
|
||||
<p>
|
||||
The <code>-c config-file</code> option can be used to specify a file
|
||||
containing the appropriate hbase parameters (e.g., hbase-site.xml) if
|
||||
not supplied already on the CLASSPATH (In addition, the CLASSPATH must
|
||||
contain the directory that has the zookeeper configuration file if
|
||||
zookeeper is NOT managed by HBase).
|
||||
</p>
|
||||
<p>
|
||||
<b>Note:</b> If the target table does not already exist in HBase, this
|
||||
tool will create the table automatically.</p>
|
||||
<p>
|
||||
This tool will run quickly, after which point the new data will be visible in
|
||||
the cluster.
|
||||
</p>
|
||||
</section>
|
||||
<section name="Using the importtsv tool to bulk load data">
|
||||
<p>
|
||||
HBase ships with a command line tool called <code>importtsv</code>
|
||||
which when given files containing data in TSV form can prepare this
|
||||
data for bulk import into HBase. This tool by default uses the HBase
|
||||
<code>put</code> API to insert data into HBase one row at a time, but
|
||||
when the "<code>importtsv.bulk.output</code>" option is used,
|
||||
<code>importtsv</code> will instead generate files using
|
||||
<code>HFileOutputFormat</code> which can subsequently be bulk-loaded
|
||||
into HBase using the <code>completebulkload</code> tool described
|
||||
above. This tool is available by running "<code>hadoop jar
|
||||
/path/to/hbase-VERSION.jar importtsv</code>". Running this command
|
||||
with no arguments prints brief usage information:
|
||||
</p>
|
||||
<code><pre>
|
||||
Usage: importtsv -Dimporttsv.columns=a,b,c <tablename> <inputdir>
|
||||
|
@ -98,41 +140,21 @@ option. This option takes the form of comma-separated column names, where each
|
|||
column name is either a simple column family, or a columnfamily:qualifier. The special
|
||||
column name HBASE_ROW_KEY is used to designate that this column should be used
|
||||
as the row key for each imported record. You must specify exactly one column
|
||||
to be the row key.
|
||||
to be the row key, and you must specify a column name for every column that exists in the
|
||||
input data.
|
||||
|
||||
By default importtsv will load data directly into HBase. To instead generate
|
||||
HFiles of data to prepare for a bulk data load, pass the option:
|
||||
-Dimporttsv.bulk.output=/path/for/output
|
||||
Note: if you do not use this option, then the target table must already exist in HBase
|
||||
|
||||
Other options that may be specified with -D include:
|
||||
-Dimporttsv.skip.bad.lines=false - fail if encountering an invalid line
|
||||
-Dimporttsv.timestamp=currentTimeAsLong - use the specified timestamp for the import
|
||||
'-Dimporttsv.separator=|' - eg separate on pipes instead of tabs
|
||||
-Dimporttsv.timestamp=currentTimeAsLong - use the specified timestamp for the import
|
||||
-Dimporttsv.mapper.class=my.Mapper - A user-defined Mapper to use instead of TsvImporterMapper
|
||||
|
||||
-Dimporttsv.mapper.class=my.Mapper - A user-defined Mapper to use instead of org.apache.hadoop.hbase.mapreduce.TsvImporterMapper
|
||||
</pre></code>
|
||||
</section>
|
||||
<section name="Importing the prepared data using the completebulkload tool">
|
||||
<p>
|
||||
After a data import has been prepared using the <code>importtsv</code> tool, the
|
||||
<code>completebulkload</code> tool is used to import the data into the running cluster.
|
||||
</p>
|
||||
<p>
|
||||
The <code>completebulkload</code> tool simply takes the same output path where
|
||||
<code>importtsv</code> put its results, and the table name. For example:
|
||||
</p>
|
||||
<code>$ hadoop jar hbase-VERSION.jar completebulkload [-c /path/to/hbase/config/hbase-site.xml] /user/todd/myoutput mytable</code>
|
||||
<p>The <code>-c config-file</code> option can be used to specify a file containing the
|
||||
appropriate hbase parameters (e.g., hbase-site.xml) if not supplied already on
|
||||
the CLASSPATH (In addition, the CLASSPATH must contain the directory that has
|
||||
the zookeeper configuration file if zookeeper is NOT managed by HBase).
|
||||
</p>
|
||||
<p>
|
||||
This tool will run quickly, after which point the new data will be visible in
|
||||
the cluster.
|
||||
</p>
|
||||
</section>
|
||||
<section name="Advanced Usage">
|
||||
<p>
|
||||
Although the <code>importtsv</code> tool is useful in many cases, advanced users may
|
||||
|
|
Loading…
Reference in New Issue