hbase-6078. ported and refactored bulk-loading information into RefGuide
git-svn-id: https://svn.apache.org/repos/asf/hbase/trunk@1342073 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
parent
be6e4c6593
commit
1f242db7f7
|
@ -2372,7 +2372,118 @@ myHtd.setValue(HTableDescriptor.SPLIT_POLICY, MyCustomSplitPolicy.class.getName(
|
|||
</section>
|
||||
</section> <!-- bloom -->
|
||||
|
||||
</section> <!-- regions -->
|
||||
|
||||
<section xml:id="arch.bulk.load"><title>Bulk Loading</title>
|
||||
<section xml:id="arch.bulk.load.overview"><title>Overview</title>
|
||||
<para>
|
||||
HBase includes several methods of loading data into tables.
|
||||
The most straightforward method is to either use the <code>TableOutputFormat</code>
|
||||
class from a MapReduce job, or use the normal client APIs; however,
|
||||
these are not always the most efficient methods.
|
||||
</para>
|
||||
<para>
|
||||
The bulk load feature uses a MapReduce job to output table data in HBase's internal
|
||||
data format, and then directly loads the generated StoreFiles into a running
|
||||
cluster. Using bulk load will use less CPU and network resources than
|
||||
simply using the HBase API.
|
||||
</para>
|
||||
</section>
|
||||
<section xml:id="arch.bulk.load.arch"><title>Bulk Load Architecture</title>
|
||||
<para>
|
||||
The HBase bulk load process consists of two main steps.
|
||||
</para>
|
||||
<section xml:id="arch.bulk.load.prep"><title>Preparing data via a MapReduce job</title>
|
||||
<para>
|
||||
The first step of a bulk load is to generate HBase data files (StoreFiles) from
|
||||
a MapReduce job using <code>HFileOutputFormat</code>. This output format writes
|
||||
out data in HBase's internal storage format so that they can be
|
||||
later loaded very efficiently into the cluster.
|
||||
</para>
|
||||
<para>
|
||||
In order to function efficiently, <code>HFileOutputFormat</code> must be
|
||||
configured such that each output HFile fits within a single region.
|
||||
In order to do this, jobs whose output will be bulk loaded into HBase
|
||||
use Hadoop's <code>TotalOrderPartitioner</code> class to partition the map output
|
||||
into disjoint ranges of the key space, corresponding to the key
|
||||
ranges of the regions in the table.
|
||||
</para>
|
||||
<para>
|
||||
<code>HFileOutputFormat</code> includes a convenience function,
|
||||
<code>configureIncrementalLoad()</code>, which automatically sets up
|
||||
a <code>TotalOrderPartitioner</code> based on the current region boundaries of a
|
||||
table.
|
||||
</para>
|
||||
</section>
|
||||
<section xml:id="arch.bulk.load.complete"><title>Completing the data load</title>
|
||||
<para>
|
||||
After the data has been prepared using
|
||||
<code>HFileOutputFormat</code>, it is loaded into the cluster using
|
||||
<code>completebulkload</code>. This command line tool iterates
|
||||
through the prepared data files, and for each one determines the
|
||||
region the file belongs to. It then contacts the appropriate Region
|
||||
Server which adopts the HFile, moving it into its storage directory
|
||||
and making the data available to clients.
|
||||
</para>
|
||||
<para>
|
||||
If the region boundaries have changed during the course of bulk load
|
||||
preparation, or between the preparation and completion steps, the
|
||||
<code>completebulkloads</code> utility will automatically split the
|
||||
data files into pieces corresponding to the new boundaries. This
|
||||
process is not optimally efficient, so users should take care to
|
||||
minimize the delay between preparing a bulk load and importing it
|
||||
into the cluster, especially if other clients are simultaneously
|
||||
loading data through other means.
|
||||
</para>
|
||||
</section>
|
||||
</section>
|
||||
<section xml:id="arch.bulk.load.import"><title>Importing the prepared data using the completebulkload tool</title>
|
||||
<para>
|
||||
After a data import has been prepared, either by using the
|
||||
<code>importtsv</code> tool with the
|
||||
"<code>importtsv.bulk.output</code>" option or by some other MapReduce
|
||||
job using the <code>HFileOutputFormat</code>, the
|
||||
<code>completebulkload</code> tool is used to import the data into the
|
||||
running cluster.
|
||||
</para>
|
||||
<para>
|
||||
The <code>completebulkload</code> tool simply takes the output path
|
||||
where <code>importtsv</code> or your MapReduce job put its results, and
|
||||
the table name to import into. For example:
|
||||
</para>
|
||||
<code>$ hadoop jar hbase-VERSION.jar completebulkload [-c /path/to/hbase/config/hbase-site.xml] /user/todd/myoutput mytable</code>
|
||||
<para>
|
||||
The <code>-c config-file</code> option can be used to specify a file
|
||||
containing the appropriate hbase parameters (e.g., hbase-site.xml) if
|
||||
not supplied already on the CLASSPATH (In addition, the CLASSPATH must
|
||||
contain the directory that has the zookeeper configuration file if
|
||||
zookeeper is NOT managed by HBase).
|
||||
</para>
|
||||
<para>
|
||||
Note: If the target table does not already exist in HBase, this
|
||||
tool will create the table automatically.</para>
|
||||
<para>
|
||||
This tool will run quickly, after which point the new data will be visible in
|
||||
the cluster.
|
||||
</para>
|
||||
</section>
|
||||
<section xml:id="arch.bulk.load.also"><title>See Also</title>
|
||||
<para>For more information about the referenced utilities, see <xref linkend="importtsv"/> and <xref linkend="completebulkload"/>.
|
||||
</para>
|
||||
</section>
|
||||
<section xml:id="arch.bulk.load.adv"><title>Advanced Usage</title>
|
||||
<para>
|
||||
Although the <code>importtsv</code> tool is useful in many cases, advanced users may
|
||||
want to generate data programatically, or import data from other formats. To get
|
||||
started doing so, dig into <code>ImportTsv.java</code> and check the JavaDoc for
|
||||
HFileOutputFormat.
|
||||
</para>
|
||||
<para>
|
||||
The import step of the bulk load can also be done programatically. See the
|
||||
<code>LoadIncrementalHFiles</code> class for more information.
|
||||
</para>
|
||||
</section>
|
||||
</section> <!-- bulk loading -->
|
||||
|
||||
<section xml:id="arch.hdfs"><title>HDFS</title>
|
||||
<para>As HBase runs on HDFS (and each StoreFile is written as a file on HDFS),
|
||||
|
|
|
@ -36,6 +36,25 @@
|
|||
|
||||
<para>Here we list HBase tools for administration, analysis, fixup, and
|
||||
debugging.</para>
|
||||
<section xml:id="driver"><title>Driver</title>
|
||||
<para>There is a <code>Driver</code> class that is executed by the HBase jar can be used to invoke frequently accessed utilities. For example,
|
||||
<programlisting>HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-VERSION.jar
|
||||
</programlisting>
|
||||
... will return...
|
||||
<programlisting>
|
||||
An example program must be given as the first argument.
|
||||
Valid program names are:
|
||||
completebulkload: Complete a bulk data load.
|
||||
copytable: Export a table from local cluster to peer cluster
|
||||
export: Write table data to HDFS.
|
||||
import: Import data written by Export.
|
||||
importtsv: Import data in TSV format.
|
||||
rowcounter: Count rows in HBase table
|
||||
verifyrep: Compare the data from tables in two different clusters. WARNING: It doesn't work for incrementColumnValues'd cells since the timestamp is chan
|
||||
</programlisting>
|
||||
... for allowable program names.
|
||||
</para>
|
||||
</section>
|
||||
<section xml:id="hbck">
|
||||
<title>HBase <application>hbck</application></title>
|
||||
<subtitle>An <emphasis>fsck</emphasis> for your HBase install</subtitle>
|
||||
|
@ -133,15 +152,92 @@
|
|||
</section>
|
||||
<section xml:id="importtsv">
|
||||
<title>ImportTsv</title>
|
||||
<para>Import is a utility that will load data in TSV format into HBase. Invoke via:
|
||||
<programlisting>$ bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=a,b,c <tablename> <inputdir>
|
||||
<para>ImportTsv is a utility that will load data in TSV format into HBase. It has two distinct usages: loading data from TSV format in HDFS
|
||||
into HBase via Puts, and preparing StoreFiles to be loaded via the <code>completebulkload</code>.
|
||||
</para>
|
||||
<para>To load data via Puts (i.e., non-bulk loading):
|
||||
<programlisting>$ bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=a,b,c <tablename> <hdfs-inputdir>
|
||||
</programlisting>
|
||||
</para>
|
||||
<para>To generate StoreFiles for bulk-loading:
|
||||
<programlisting>$ bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=a,b,c -Dimporttsv.bulk.output=hdfs://storefile-outputdir <tablename> <hdfs-data-inputdir>
|
||||
</programlisting>
|
||||
</para>
|
||||
<para>These generated StoreFiles can be loaded into HBase via <xref linkend="completebulkload"/>.
|
||||
</para>
|
||||
<section xml:id="importtsv.options"><title>ImportTsv Options</title>
|
||||
Running ImportTsv with no arguments prints brief usage information:
|
||||
<programlisting>
|
||||
Usage: importtsv -Dimporttsv.columns=a,b,c <tablename> <inputdir>
|
||||
|
||||
Imports the given input directory of TSV data into the specified table.
|
||||
|
||||
The column names of the TSV data must be specified using the -Dimporttsv.columns
|
||||
option. This option takes the form of comma-separated column names, where each
|
||||
column name is either a simple column family, or a columnfamily:qualifier. The special
|
||||
column name HBASE_ROW_KEY is used to designate that this column should be used
|
||||
as the row key for each imported record. You must specify exactly one column
|
||||
to be the row key, and you must specify a column name for every column that exists in the
|
||||
input data.
|
||||
|
||||
By default importtsv will load data directly into HBase. To instead generate
|
||||
HFiles of data to prepare for a bulk data load, pass the option:
|
||||
-Dimporttsv.bulk.output=/path/for/output
|
||||
Note: if you do not use this option, then the target table must already exist in HBase
|
||||
|
||||
Other options that may be specified with -D include:
|
||||
-Dimporttsv.skip.bad.lines=false - fail if encountering an invalid line
|
||||
'-Dimporttsv.separator=|' - eg separate on pipes instead of tabs
|
||||
-Dimporttsv.timestamp=currentTimeAsLong - use the specified timestamp for the import
|
||||
-Dimporttsv.mapper.class=my.Mapper - A user-defined Mapper to use instead of org.apache.hadoop.hbase.mapreduce.TsvImporterMapper
|
||||
</programlisting>
|
||||
</section>
|
||||
<section xml:id="bulk.loading">
|
||||
<title>Bulk Loading</title>
|
||||
<para>For imformation about bulk-loading HFiles into HBase, see <link xlink:href="http://hbase.apache.org/bulk-loads.html">Bulk Loads</link>.
|
||||
This page currently exists on the website and will eventually be migrated into the RefGuide.
|
||||
<section xml:id="importtsv.example"><title>ImportTsv Example</title>
|
||||
<para>For example, assume that we are loading data into a table called 'datatsv' with a ColumnFamily called 'd' with two columns "c1" and "c2".
|
||||
</para>
|
||||
<para>Assume that an input file exists as follows:
|
||||
<programlisting>
|
||||
row1 c1 c2
|
||||
row2 c1 c2
|
||||
row3 c1 c2
|
||||
row4 c1 c2
|
||||
row5 c1 c2
|
||||
row6 c1 c2
|
||||
row7 c1 c2
|
||||
row8 c1 c2
|
||||
row9 c1 c2
|
||||
row10 c1 c2
|
||||
</programlisting>
|
||||
</para>
|
||||
<para>For ImportTsv to use this imput file, the command line needs to look like this:
|
||||
<programlisting>
|
||||
HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-VERSION.jar importtsv -Dimporttsv.columns=HBASE_ROW_KEY,d:c1,d:c2 -Dimporttsv.bulk.output=hdfs://storefileoutput datatsv hdfs://inputfile
|
||||
</programlisting>
|
||||
... and in this example the first column is the rowkey, which is why the HBASE_ROW_KEY is used. The second and third columns in the file will be imported as "d:c1" and "d:c2", respectively.
|
||||
</para>
|
||||
</section>
|
||||
<section xml:id="importtsv.warning"><title>ImportTsv Warning</title>
|
||||
<para>If you have preparing a lot of data for bulk loading, make sure the target HBase table is pre-split appropriately.
|
||||
</para>
|
||||
</section>
|
||||
<section xml:id="importtsv.also"><title>See Also</title>
|
||||
For more information about bulk-loading HFiles into HBase, see <xref linkend="arch.bulk.load"/>
|
||||
</section>
|
||||
</section>
|
||||
|
||||
<section xml:id="completebulkload">
|
||||
<title>CompleteBulkLoad</title>
|
||||
<para>The <code>completebulkload</code> utility will move generated StoreFiles into an HBase table. This utility is often used
|
||||
in conjunction with output from <xref linkend="importtsv"/>.
|
||||
</para>
|
||||
<para>There are two ways to invoke this utility, with explicit classname and via the driver:
|
||||
<programlisting>$ bin/hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFile <hdfs://storefileoutput> <tablename>
|
||||
</programlisting>
|
||||
.. and via the Driver..
|
||||
<programlisting>HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-VERSION.jar completebulkload <hdfs://storefileoutput> <tablename>
|
||||
</programlisting>
|
||||
</para>
|
||||
<para>For more information about bulk-loading HFiles into HBase, see <xref linkend="arch.bulk.load"/>.
|
||||
</para>
|
||||
</section>
|
||||
<section xml:id="walplayer">
|
||||
|
|
|
@ -23,149 +23,9 @@
|
|||
</title>
|
||||
</properties>
|
||||
<body>
|
||||
<section name="Overview">
|
||||
<p>
|
||||
HBase includes several methods of loading data into tables.
|
||||
The most straightforward method is to either use the TableOutputFormat
|
||||
class from a MapReduce job, or use the normal client APIs; however,
|
||||
these are not always the most efficient methods.
|
||||
<p>This page has been retired. The contents have been moved to the
|
||||
<a href="http://hbase.apache.org/book.html#arch.bulk.load">Bulk Loading</a> section
|
||||
in the Reference Guide.
|
||||
</p>
|
||||
<p>
|
||||
This document describes HBase's bulk load functionality. The bulk load
|
||||
feature uses a MapReduce job to output table data in HBase's internal
|
||||
data format, and then directly loads the data files into a running
|
||||
cluster. Using bulk load will use less CPU and network resources than
|
||||
simply using the HBase API.
|
||||
</p>
|
||||
</section>
|
||||
<section name="Bulk Load Architecture">
|
||||
<p>
|
||||
The HBase bulk load process consists of two main steps.
|
||||
</p>
|
||||
<section name="Preparing data via a MapReduce job">
|
||||
<p>
|
||||
The first step of a bulk load is to generate HBase data files from
|
||||
a MapReduce job using HFileOutputFormat. This output format writes
|
||||
out data in HBase's internal storage format so that they can be
|
||||
later loaded very efficiently into the cluster.
|
||||
</p>
|
||||
<p>
|
||||
In order to function efficiently, HFileOutputFormat must be
|
||||
configured such that each output HFile fits within a single region.
|
||||
In order to do this, jobs whose output will be bulk loaded into HBase
|
||||
use Hadoop's TotalOrderPartitioner class to partition the map output
|
||||
into disjoint ranges of the key space, corresponding to the key
|
||||
ranges of the regions in the table.
|
||||
</p>
|
||||
<p>
|
||||
HFileOutputFormat includes a convenience function,
|
||||
<code>configureIncrementalLoad()</code>, which automatically sets up
|
||||
a TotalOrderPartitioner based on the current region boundaries of a
|
||||
table.
|
||||
</p>
|
||||
</section>
|
||||
<section name="Completing the data load">
|
||||
<p>
|
||||
After the data has been prepared using
|
||||
<code>HFileOutputFormat</code>, it is loaded into the cluster using
|
||||
<code>completebulkload</code>. This command line tool iterates
|
||||
through the prepared data files, and for each one determines the
|
||||
region the file belongs to. It then contacts the appropriate Region
|
||||
Server which adopts the HFile, moving it into its storage directory
|
||||
and making the data available to clients.
|
||||
</p>
|
||||
<p>
|
||||
If the region boundaries have changed during the course of bulk load
|
||||
preparation, or between the preparation and completion steps, the
|
||||
<code>completebulkloads</code> utility will automatically split the
|
||||
data files into pieces corresponding to the new boundaries. This
|
||||
process is not optimally efficient, so users should take care to
|
||||
minimize the delay between preparing a bulk load and importing it
|
||||
into the cluster, especially if other clients are simultaneously
|
||||
loading data through other means.
|
||||
</p>
|
||||
</section>
|
||||
</section>
|
||||
<section name="Importing the prepared data using the completebulkload tool">
|
||||
<p>
|
||||
After a data import has been prepared, either by using the
|
||||
<code>importtsv</code> tool with the
|
||||
"<code>importtsv.bulk.output</code>" option or by some other MapReduce
|
||||
job using the <code>HFileOutputFormat</code>, the
|
||||
<code>completebulkload</code> tool is used to import the data into the
|
||||
running cluster.
|
||||
</p>
|
||||
<p>
|
||||
The <code>completebulkload</code> tool simply takes the output path
|
||||
where <code>importtsv</code> or your MapReduce job put its results, and
|
||||
the table name to import into. For example:
|
||||
</p>
|
||||
<code>$ hadoop jar hbase-VERSION.jar completebulkload [-c /path/to/hbase/config/hbase-site.xml] /user/todd/myoutput mytable</code>
|
||||
<p>
|
||||
The <code>-c config-file</code> option can be used to specify a file
|
||||
containing the appropriate hbase parameters (e.g., hbase-site.xml) if
|
||||
not supplied already on the CLASSPATH (In addition, the CLASSPATH must
|
||||
contain the directory that has the zookeeper configuration file if
|
||||
zookeeper is NOT managed by HBase).
|
||||
</p>
|
||||
<p>
|
||||
<b>Note:</b> If the target table does not already exist in HBase, this
|
||||
tool will create the table automatically.</p>
|
||||
<p>
|
||||
This tool will run quickly, after which point the new data will be visible in
|
||||
the cluster.
|
||||
</p>
|
||||
</section>
|
||||
<section name="Using the importtsv tool to bulk load data">
|
||||
<p>
|
||||
HBase ships with a command line tool called <code>importtsv</code>
|
||||
which when given files containing data in TSV form can prepare this
|
||||
data for bulk import into HBase. This tool by default uses the HBase
|
||||
<code>put</code> API to insert data into HBase one row at a time, but
|
||||
when the "<code>importtsv.bulk.output</code>" option is used,
|
||||
<code>importtsv</code> will instead generate files using
|
||||
<code>HFileOutputFormat</code> which can subsequently be bulk-loaded
|
||||
into HBase using the <code>completebulkload</code> tool described
|
||||
above. This tool is available by running "<code>hadoop jar
|
||||
/path/to/hbase-VERSION.jar importtsv</code>". Running this command
|
||||
with no arguments prints brief usage information:
|
||||
</p>
|
||||
<code><pre>
|
||||
Usage: importtsv -Dimporttsv.columns=a,b,c <tablename> <inputdir>
|
||||
|
||||
Imports the given input directory of TSV data into the specified table.
|
||||
|
||||
The column names of the TSV data must be specified using the -Dimporttsv.columns
|
||||
option. This option takes the form of comma-separated column names, where each
|
||||
column name is either a simple column family, or a columnfamily:qualifier. The special
|
||||
column name HBASE_ROW_KEY is used to designate that this column should be used
|
||||
as the row key for each imported record. You must specify exactly one column
|
||||
to be the row key, and you must specify a column name for every column that exists in the
|
||||
input data.
|
||||
|
||||
By default importtsv will load data directly into HBase. To instead generate
|
||||
HFiles of data to prepare for a bulk data load, pass the option:
|
||||
-Dimporttsv.bulk.output=/path/for/output
|
||||
Note: if you do not use this option, then the target table must already exist in HBase
|
||||
|
||||
Other options that may be specified with -D include:
|
||||
-Dimporttsv.skip.bad.lines=false - fail if encountering an invalid line
|
||||
'-Dimporttsv.separator=|' - eg separate on pipes instead of tabs
|
||||
-Dimporttsv.timestamp=currentTimeAsLong - use the specified timestamp for the import
|
||||
-Dimporttsv.mapper.class=my.Mapper - A user-defined Mapper to use instead of org.apache.hadoop.hbase.mapreduce.TsvImporterMapper
|
||||
</pre></code>
|
||||
</section>
|
||||
<section name="Advanced Usage">
|
||||
<p>
|
||||
Although the <code>importtsv</code> tool is useful in many cases, advanced users may
|
||||
want to generate data programatically, or import data from other formats. To get
|
||||
started doing so, dig into <code>ImportTsv.java</code> and check the JavaDoc for
|
||||
HFileOutputFormat.
|
||||
</p>
|
||||
<p>
|
||||
The import step of the bulk load can also be done programatically. See the
|
||||
<code>LoadIncrementalHFiles</code> class for more information.
|
||||
</p>
|
||||
</section>
|
||||
</body>
|
||||
</document>
|
||||
|
|
Loading…
Reference in New Issue