hbase-6078. ported and refactored bulk-loading information into RefGuide

git-svn-id: https://svn.apache.org/repos/asf/hbase/trunk@1342073 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
Doug Meil 2012-05-23 22:09:37 +00:00
parent be6e4c6593
commit 1f242db7f7
3 changed files with 217 additions and 150 deletions

View File

@ -2372,7 +2372,118 @@ myHtd.setValue(HTableDescriptor.SPLIT_POLICY, MyCustomSplitPolicy.class.getName(
</section>
</section> <!-- bloom -->
</section> <!-- regions -->
<section xml:id="arch.bulk.load"><title>Bulk Loading</title>
<section xml:id="arch.bulk.load.overview"><title>Overview</title>
<para>
HBase includes several methods of loading data into tables.
The most straightforward method is to either use the <code>TableOutputFormat</code>
class from a MapReduce job, or use the normal client APIs; however,
these are not always the most efficient methods.
</para>
<para>
The bulk load feature uses a MapReduce job to output table data in HBase's internal
data format, and then directly loads the generated StoreFiles into a running
cluster. Using bulk load will use less CPU and network resources than
simply using the HBase API.
</para>
</section>
<section xml:id="arch.bulk.load.arch"><title>Bulk Load Architecture</title>
<para>
The HBase bulk load process consists of two main steps.
</para>
<section xml:id="arch.bulk.load.prep"><title>Preparing data via a MapReduce job</title>
<para>
The first step of a bulk load is to generate HBase data files (StoreFiles) from
a MapReduce job using <code>HFileOutputFormat</code>. This output format writes
out data in HBase's internal storage format so that they can be
later loaded very efficiently into the cluster.
</para>
<para>
In order to function efficiently, <code>HFileOutputFormat</code> must be
configured such that each output HFile fits within a single region.
In order to do this, jobs whose output will be bulk loaded into HBase
use Hadoop's <code>TotalOrderPartitioner</code> class to partition the map output
into disjoint ranges of the key space, corresponding to the key
ranges of the regions in the table.
</para>
<para>
<code>HFileOutputFormat</code> includes a convenience function,
<code>configureIncrementalLoad()</code>, which automatically sets up
a <code>TotalOrderPartitioner</code> based on the current region boundaries of a
table.
</para>
</section>
<section xml:id="arch.bulk.load.complete"><title>Completing the data load</title>
<para>
After the data has been prepared using
<code>HFileOutputFormat</code>, it is loaded into the cluster using
<code>completebulkload</code>. This command line tool iterates
through the prepared data files, and for each one determines the
region the file belongs to. It then contacts the appropriate Region
Server which adopts the HFile, moving it into its storage directory
and making the data available to clients.
</para>
<para>
If the region boundaries have changed during the course of bulk load
preparation, or between the preparation and completion steps, the
<code>completebulkloads</code> utility will automatically split the
data files into pieces corresponding to the new boundaries. This
process is not optimally efficient, so users should take care to
minimize the delay between preparing a bulk load and importing it
into the cluster, especially if other clients are simultaneously
loading data through other means.
</para>
</section>
</section>
<section xml:id="arch.bulk.load.import"><title>Importing the prepared data using the completebulkload tool</title>
<para>
After a data import has been prepared, either by using the
<code>importtsv</code> tool with the
"<code>importtsv.bulk.output</code>" option or by some other MapReduce
job using the <code>HFileOutputFormat</code>, the
<code>completebulkload</code> tool is used to import the data into the
running cluster.
</para>
<para>
The <code>completebulkload</code> tool simply takes the output path
where <code>importtsv</code> or your MapReduce job put its results, and
the table name to import into. For example:
</para>
<code>$ hadoop jar hbase-VERSION.jar completebulkload [-c /path/to/hbase/config/hbase-site.xml] /user/todd/myoutput mytable</code>
<para>
The <code>-c config-file</code> option can be used to specify a file
containing the appropriate hbase parameters (e.g., hbase-site.xml) if
not supplied already on the CLASSPATH (In addition, the CLASSPATH must
contain the directory that has the zookeeper configuration file if
zookeeper is NOT managed by HBase).
</para>
<para>
Note: If the target table does not already exist in HBase, this
tool will create the table automatically.</para>
<para>
This tool will run quickly, after which point the new data will be visible in
the cluster.
</para>
</section>
<section xml:id="arch.bulk.load.also"><title>See Also</title>
<para>For more information about the referenced utilities, see <xref linkend="importtsv"/> and <xref linkend="completebulkload"/>.
</para>
</section>
<section xml:id="arch.bulk.load.adv"><title>Advanced Usage</title>
<para>
Although the <code>importtsv</code> tool is useful in many cases, advanced users may
want to generate data programatically, or import data from other formats. To get
started doing so, dig into <code>ImportTsv.java</code> and check the JavaDoc for
HFileOutputFormat.
</para>
<para>
The import step of the bulk load can also be done programatically. See the
<code>LoadIncrementalHFiles</code> class for more information.
</para>
</section>
</section> <!-- bulk loading -->
<section xml:id="arch.hdfs"><title>HDFS</title>
<para>As HBase runs on HDFS (and each StoreFile is written as a file on HDFS),

View File

@ -36,6 +36,25 @@
<para>Here we list HBase tools for administration, analysis, fixup, and
debugging.</para>
<section xml:id="driver"><title>Driver</title>
<para>There is a <code>Driver</code> class that is executed by the HBase jar can be used to invoke frequently accessed utilities. For example,
<programlisting>HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-VERSION.jar
</programlisting>
... will return...
<programlisting>
An example program must be given as the first argument.
Valid program names are:
completebulkload: Complete a bulk data load.
copytable: Export a table from local cluster to peer cluster
export: Write table data to HDFS.
import: Import data written by Export.
importtsv: Import data in TSV format.
rowcounter: Count rows in HBase table
verifyrep: Compare the data from tables in two different clusters. WARNING: It doesn't work for incrementColumnValues'd cells since the timestamp is chan
</programlisting>
... for allowable program names.
</para>
</section>
<section xml:id="hbck">
<title>HBase <application>hbck</application></title>
<subtitle>An <emphasis>fsck</emphasis> for your HBase install</subtitle>
@ -133,15 +152,92 @@
</section>
<section xml:id="importtsv">
<title>ImportTsv</title>
<para>Import is a utility that will load data in TSV format into HBase. Invoke via:
<programlisting>$ bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=a,b,c &lt;tablename&gt; &lt;inputdir&gt;
<para>ImportTsv is a utility that will load data in TSV format into HBase. It has two distinct usages: loading data from TSV format in HDFS
into HBase via Puts, and preparing StoreFiles to be loaded via the <code>completebulkload</code>.
</para>
<para>To load data via Puts (i.e., non-bulk loading):
<programlisting>$ bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=a,b,c &lt;tablename&gt; &lt;hdfs-inputdir&gt;
</programlisting>
</para>
<para>To generate StoreFiles for bulk-loading:
<programlisting>$ bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=a,b,c -Dimporttsv.bulk.output=hdfs://storefile-outputdir &lt;tablename&gt; &lt;hdfs-data-inputdir&gt;
</programlisting>
</para>
<para>These generated StoreFiles can be loaded into HBase via <xref linkend="completebulkload"/>.
</para>
<section xml:id="importtsv.options"><title>ImportTsv Options</title>
Running ImportTsv with no arguments prints brief usage information:
<programlisting>
Usage: importtsv -Dimporttsv.columns=a,b,c &lt;tablename&gt; &lt;inputdir&gt;
Imports the given input directory of TSV data into the specified table.
The column names of the TSV data must be specified using the -Dimporttsv.columns
option. This option takes the form of comma-separated column names, where each
column name is either a simple column family, or a columnfamily:qualifier. The special
column name HBASE_ROW_KEY is used to designate that this column should be used
as the row key for each imported record. You must specify exactly one column
to be the row key, and you must specify a column name for every column that exists in the
input data.
By default importtsv will load data directly into HBase. To instead generate
HFiles of data to prepare for a bulk data load, pass the option:
-Dimporttsv.bulk.output=/path/for/output
Note: if you do not use this option, then the target table must already exist in HBase
Other options that may be specified with -D include:
-Dimporttsv.skip.bad.lines=false - fail if encountering an invalid line
'-Dimporttsv.separator=|' - eg separate on pipes instead of tabs
-Dimporttsv.timestamp=currentTimeAsLong - use the specified timestamp for the import
-Dimporttsv.mapper.class=my.Mapper - A user-defined Mapper to use instead of org.apache.hadoop.hbase.mapreduce.TsvImporterMapper
</programlisting>
</section>
<section xml:id="bulk.loading">
<title>Bulk Loading</title>
<para>For imformation about bulk-loading HFiles into HBase, see <link xlink:href="http://hbase.apache.org/bulk-loads.html">Bulk Loads</link>.
This page currently exists on the website and will eventually be migrated into the RefGuide.
<section xml:id="importtsv.example"><title>ImportTsv Example</title>
<para>For example, assume that we are loading data into a table called 'datatsv' with a ColumnFamily called 'd' with two columns "c1" and "c2".
</para>
<para>Assume that an input file exists as follows:
<programlisting>
row1 c1 c2
row2 c1 c2
row3 c1 c2
row4 c1 c2
row5 c1 c2
row6 c1 c2
row7 c1 c2
row8 c1 c2
row9 c1 c2
row10 c1 c2
</programlisting>
</para>
<para>For ImportTsv to use this imput file, the command line needs to look like this:
<programlisting>
HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-VERSION.jar importtsv -Dimporttsv.columns=HBASE_ROW_KEY,d:c1,d:c2 -Dimporttsv.bulk.output=hdfs://storefileoutput datatsv hdfs://inputfile
</programlisting>
... and in this example the first column is the rowkey, which is why the HBASE_ROW_KEY is used. The second and third columns in the file will be imported as "d:c1" and "d:c2", respectively.
</para>
</section>
<section xml:id="importtsv.warning"><title>ImportTsv Warning</title>
<para>If you have preparing a lot of data for bulk loading, make sure the target HBase table is pre-split appropriately.
</para>
</section>
<section xml:id="importtsv.also"><title>See Also</title>
For more information about bulk-loading HFiles into HBase, see <xref linkend="arch.bulk.load"/>
</section>
</section>
<section xml:id="completebulkload">
<title>CompleteBulkLoad</title>
<para>The <code>completebulkload</code> utility will move generated StoreFiles into an HBase table. This utility is often used
in conjunction with output from <xref linkend="importtsv"/>.
</para>
<para>There are two ways to invoke this utility, with explicit classname and via the driver:
<programlisting>$ bin/hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFile &lt;hdfs://storefileoutput&gt; &lt;tablename&gt;
</programlisting>
.. and via the Driver..
<programlisting>HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-VERSION.jar completebulkload &lt;hdfs://storefileoutput&gt; &lt;tablename&gt;
</programlisting>
</para>
<para>For more information about bulk-loading HFiles into HBase, see <xref linkend="arch.bulk.load"/>.
</para>
</section>
<section xml:id="walplayer">

View File

@ -23,149 +23,9 @@
</title>
</properties>
<body>
<section name="Overview">
<p>
HBase includes several methods of loading data into tables.
The most straightforward method is to either use the TableOutputFormat
class from a MapReduce job, or use the normal client APIs; however,
these are not always the most efficient methods.
<p>This page has been retired. The contents have been moved to the
<a href="http://hbase.apache.org/book.html#arch.bulk.load">Bulk Loading</a> section
in the Reference Guide.
</p>
<p>
This document describes HBase's bulk load functionality. The bulk load
feature uses a MapReduce job to output table data in HBase's internal
data format, and then directly loads the data files into a running
cluster. Using bulk load will use less CPU and network resources than
simply using the HBase API.
</p>
</section>
<section name="Bulk Load Architecture">
<p>
The HBase bulk load process consists of two main steps.
</p>
<section name="Preparing data via a MapReduce job">
<p>
The first step of a bulk load is to generate HBase data files from
a MapReduce job using HFileOutputFormat. This output format writes
out data in HBase's internal storage format so that they can be
later loaded very efficiently into the cluster.
</p>
<p>
In order to function efficiently, HFileOutputFormat must be
configured such that each output HFile fits within a single region.
In order to do this, jobs whose output will be bulk loaded into HBase
use Hadoop's TotalOrderPartitioner class to partition the map output
into disjoint ranges of the key space, corresponding to the key
ranges of the regions in the table.
</p>
<p>
HFileOutputFormat includes a convenience function,
<code>configureIncrementalLoad()</code>, which automatically sets up
a TotalOrderPartitioner based on the current region boundaries of a
table.
</p>
</section>
<section name="Completing the data load">
<p>
After the data has been prepared using
<code>HFileOutputFormat</code>, it is loaded into the cluster using
<code>completebulkload</code>. This command line tool iterates
through the prepared data files, and for each one determines the
region the file belongs to. It then contacts the appropriate Region
Server which adopts the HFile, moving it into its storage directory
and making the data available to clients.
</p>
<p>
If the region boundaries have changed during the course of bulk load
preparation, or between the preparation and completion steps, the
<code>completebulkloads</code> utility will automatically split the
data files into pieces corresponding to the new boundaries. This
process is not optimally efficient, so users should take care to
minimize the delay between preparing a bulk load and importing it
into the cluster, especially if other clients are simultaneously
loading data through other means.
</p>
</section>
</section>
<section name="Importing the prepared data using the completebulkload tool">
<p>
After a data import has been prepared, either by using the
<code>importtsv</code> tool with the
"<code>importtsv.bulk.output</code>" option or by some other MapReduce
job using the <code>HFileOutputFormat</code>, the
<code>completebulkload</code> tool is used to import the data into the
running cluster.
</p>
<p>
The <code>completebulkload</code> tool simply takes the output path
where <code>importtsv</code> or your MapReduce job put its results, and
the table name to import into. For example:
</p>
<code>$ hadoop jar hbase-VERSION.jar completebulkload [-c /path/to/hbase/config/hbase-site.xml] /user/todd/myoutput mytable</code>
<p>
The <code>-c config-file</code> option can be used to specify a file
containing the appropriate hbase parameters (e.g., hbase-site.xml) if
not supplied already on the CLASSPATH (In addition, the CLASSPATH must
contain the directory that has the zookeeper configuration file if
zookeeper is NOT managed by HBase).
</p>
<p>
<b>Note:</b> If the target table does not already exist in HBase, this
tool will create the table automatically.</p>
<p>
This tool will run quickly, after which point the new data will be visible in
the cluster.
</p>
</section>
<section name="Using the importtsv tool to bulk load data">
<p>
HBase ships with a command line tool called <code>importtsv</code>
which when given files containing data in TSV form can prepare this
data for bulk import into HBase. This tool by default uses the HBase
<code>put</code> API to insert data into HBase one row at a time, but
when the "<code>importtsv.bulk.output</code>" option is used,
<code>importtsv</code> will instead generate files using
<code>HFileOutputFormat</code> which can subsequently be bulk-loaded
into HBase using the <code>completebulkload</code> tool described
above. This tool is available by running "<code>hadoop jar
/path/to/hbase-VERSION.jar importtsv</code>". Running this command
with no arguments prints brief usage information:
</p>
<code><pre>
Usage: importtsv -Dimporttsv.columns=a,b,c &lt;tablename&gt; &lt;inputdir&gt;
Imports the given input directory of TSV data into the specified table.
The column names of the TSV data must be specified using the -Dimporttsv.columns
option. This option takes the form of comma-separated column names, where each
column name is either a simple column family, or a columnfamily:qualifier. The special
column name HBASE_ROW_KEY is used to designate that this column should be used
as the row key for each imported record. You must specify exactly one column
to be the row key, and you must specify a column name for every column that exists in the
input data.
By default importtsv will load data directly into HBase. To instead generate
HFiles of data to prepare for a bulk data load, pass the option:
-Dimporttsv.bulk.output=/path/for/output
Note: if you do not use this option, then the target table must already exist in HBase
Other options that may be specified with -D include:
-Dimporttsv.skip.bad.lines=false - fail if encountering an invalid line
'-Dimporttsv.separator=|' - eg separate on pipes instead of tabs
-Dimporttsv.timestamp=currentTimeAsLong - use the specified timestamp for the import
-Dimporttsv.mapper.class=my.Mapper - A user-defined Mapper to use instead of org.apache.hadoop.hbase.mapreduce.TsvImporterMapper
</pre></code>
</section>
<section name="Advanced Usage">
<p>
Although the <code>importtsv</code> tool is useful in many cases, advanced users may
want to generate data programatically, or import data from other formats. To get
started doing so, dig into <code>ImportTsv.java</code> and check the JavaDoc for
HFileOutputFormat.
</p>
<p>
The import step of the bulk load can also be done programatically. See the
<code>LoadIncrementalHFiles</code> class for more information.
</p>
</section>
</body>
</document>