HBASE-8807 HBase MapReduce Job-Launch Documentation Misplaced (Misty Stanley-Jones)

This commit is contained in:
Michael Stack 2014-05-27 15:13:23 -07:00
parent d9739b9e3f
commit ef995efb1a
3 changed files with 336 additions and 407 deletions

View File

@ -20,104 +20,7 @@
Provides HBase <a href="http://wiki.apache.org/hadoop/HadoopMapReduce">MapReduce</a> Provides HBase <a href="http://wiki.apache.org/hadoop/HadoopMapReduce">MapReduce</a>
Input/OutputFormats, a table indexing MapReduce job, and utility Input/OutputFormats, a table indexing MapReduce job, and utility
<h2>Table of Contents</h2> <p>See <a href="http://hbase.apache.org/book.html#mapreduce">HBase and MapReduce</a>
<ul> in the HBase Reference Guide for mapreduce over hbase documentation.
<li><a href="#classpath">HBase, MapReduce and the CLASSPATH</a></li>
<li><a href="#sink">HBase as MapReduce job data source and sink</a></li>
<li><a href="#examples">Example Code</a></li>
</ul>
<h2><a name="classpath">HBase, MapReduce and the CLASSPATH</a></h2>
<p>MapReduce jobs deployed to a MapReduce cluster do not by default have access
to the HBase configuration under <code>$HBASE_CONF_DIR</code> nor to HBase classes.
You could add <code>hbase-site.xml</code> to $HADOOP_HOME/conf and add
<code>hbase-X.X.X.jar</code> to the <code>$HADOOP_HOME/lib</code> and copy these
changes across your cluster but the cleanest means of adding hbase configuration
and classes to the cluster <code>CLASSPATH</code> is by uncommenting
<code>HADOOP_CLASSPATH</code> in <code>$HADOOP_HOME/conf/hadoop-env.sh</code>
adding hbase dependencies here. For example, here is how you would amend
<code>hadoop-env.sh</code> adding the
built hbase jar, zookeeper (needed by hbase client), hbase conf, and the
<code>PerformanceEvaluation</code> class from the built hbase test jar to the
hadoop <code>CLASSPATH</code>:
<blockquote><pre># Extra Java CLASSPATH elements. Optional.
# export HADOOP_CLASSPATH=
export HADOOP_CLASSPATH=$HBASE_HOME/build/hbase-X.X.X.jar:$HBASE_HOME/build/hbase-X.X.X-test.jar:$HBASE_HOME/conf:${HBASE_HOME}/lib/zookeeper-X.X.X.jar</pre></blockquote>
<p>Expand <code>$HBASE_HOME</code> in the above appropriately to suit your
local environment.</p>
<p>After copying the above change around your cluster (and restarting), this is
how you would run the PerformanceEvaluation MR job to put up 4 clients (Presumes
a ready mapreduce cluster):
<blockquote><pre>$HADOOP_HOME/bin/hadoop org.apache.hadoop.hbase.PerformanceEvaluation sequentialWrite 4</pre></blockquote>
The PerformanceEvaluation class wil be found on the CLASSPATH because you
added <code>$HBASE_HOME/build/test</code> to HADOOP_CLASSPATH
</p>
<p>Another possibility, if for example you do not have access to hadoop-env.sh or
are unable to restart the hadoop cluster, is bundling the hbase jar into a mapreduce
job jar adding it and its dependencies under the job jar <code>lib/</code>
directory and the hbase conf into a job jar <code>conf/</code> directory.
</a>
<h2><a name="sink">HBase as MapReduce job data source and sink</a></h2>
<p>HBase can be used as a data source, {@link org.apache.hadoop.hbase.mapred.TableInputFormat TableInputFormat},
and data sink, {@link org.apache.hadoop.hbase.mapred.TableOutputFormat TableOutputFormat}, for MapReduce jobs.
Writing MapReduce jobs that read or write HBase, you'll probably want to subclass
{@link org.apache.hadoop.hbase.mapred.TableMap TableMap} and/or
{@link org.apache.hadoop.hbase.mapred.TableReduce TableReduce}. See the do-nothing
pass-through classes {@link org.apache.hadoop.hbase.mapred.IdentityTableMap IdentityTableMap} and
{@link org.apache.hadoop.hbase.mapred.IdentityTableReduce IdentityTableReduce} for basic usage. For a more
involved example, see <code>BuildTableIndex</code>
or review the <code>org.apache.hadoop.hbase.mapred.TestTableMapReduce</code> unit test.
</p>
<p>Running mapreduce jobs that have hbase as source or sink, you'll need to
specify source/sink table and column names in your configuration.</p>
<p>Reading from hbase, the TableInputFormat asks hbase for the list of
regions and makes a map-per-region or <code>mapred.map.tasks maps</code>,
whichever is smaller (If your job only has two maps, up mapred.map.tasks
to a number > number of regions). Maps will run on the adjacent TaskTracker
if you are running a TaskTracer and RegionServer per node.
Writing, it may make sense to avoid the reduce step and write yourself back into
hbase from inside your map. You'd do this when your job does not need the sort
and collation that mapreduce does on the map emitted data; on insert,
hbase 'sorts' so there is no point double-sorting (and shuffling data around
your mapreduce cluster) unless you need to. If you do not need the reduce,
you might just have your map emit counts of records processed just so the
framework's report at the end of your job has meaning or set the number of
reduces to zero and use TableOutputFormat. See example code
below. If running the reduce step makes sense in your case, its usually better
to have lots of reducers so load is spread across the hbase cluster.</p>
<p>There is also a new hbase partitioner that will run as many reducers as
currently existing regions. The
{@link org.apache.hadoop.hbase.mapred.HRegionPartitioner} is suitable
when your table is large and your upload is not such that it will greatly
alter the number of existing regions when done; other use the default
partitioner.
</p>
<h2><a name="examples">Example Code</a></h2>
<h3>Sample Row Counter</h3>
<p>See {@link org.apache.hadoop.hbase.mapred.RowCounter}. You should be able to run
it by doing: <code>% ./bin/hadoop jar hbase-X.X.X.jar</code>. This will invoke
the hbase MapReduce Driver class. Select 'rowcounter' from the choice of jobs
offered. You may need to add the hbase conf directory to <code>$HADOOP_HOME/conf/hadoop-env.sh#HADOOP_CLASSPATH</code>
so the rowcounter gets pointed at the right hbase cluster (or, build a new jar
with an appropriate hbase-site.xml built into your job jar).
</p>
<h3>PerformanceEvaluation</h3>
<p>See org.apache.hadoop.hbase.PerformanceEvaluation from hbase src/test. It runs
a mapreduce job to run concurrent clients reading and writing hbase.
</p>
*/ */
package org.apache.hadoop.hbase.mapred; package org.apache.hadoop.hbase.mapred;

View File

@ -20,144 +20,7 @@
Provides HBase <a href="http://wiki.apache.org/hadoop/HadoopMapReduce">MapReduce</a> Provides HBase <a href="http://wiki.apache.org/hadoop/HadoopMapReduce">MapReduce</a>
Input/OutputFormats, a table indexing MapReduce job, and utility Input/OutputFormats, a table indexing MapReduce job, and utility
<h2>Table of Contents</h2> <p>See <a href="http://hbase.apache.org/book.html#mapreduce">HBase and MapReduce</a>
<ul> in the HBase Reference Guide for mapreduce over hbase documentation.
<li><a href="#classpath">HBase, MapReduce and the CLASSPATH</a></li>
<li><a href="#driver">Bundled HBase MapReduce Jobs</a></li>
<li><a href="#sink">HBase as MapReduce job data source and sink</a></li>
<li><a href="#bulk">Bulk Import writing HFiles directly</a></li>
<li><a href="#examples">Example Code</a></li>
</ul>
<h2><a name="classpath">HBase, MapReduce and the CLASSPATH</a></h2>
<p>MapReduce jobs deployed to a MapReduce cluster do not by default have access
to the HBase configuration under <code>$HBASE_CONF_DIR</code> nor to HBase classes.
You could add <code>hbase-site.xml</code> to
<code>$HADOOP_HOME/conf</code> and add
HBase jars to the <code>$HADOOP_HOME/lib</code> and copy these
changes across your cluster (or edit conf/hadoop-env.sh and add them to the
<code>HADOOP_CLASSPATH</code> variable) but this will pollute your
hadoop install with HBase references; its also obnoxious requiring restart of
the hadoop cluster before it'll notice your HBase additions.</p>
<p>As of 0.90.x, HBase will just add its dependency jars to the job
configuration; the dependencies just need to be available on the local
<code>CLASSPATH</code>. For example, to run the bundled HBase
{@link org.apache.hadoop.hbase.mapreduce.RowCounter} mapreduce job against a table named <code>usertable</code>,
type:
<blockquote><pre>
$ HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-0.90.0.jar rowcounter usertable
</pre></blockquote>
Expand <code>$HBASE_HOME</code> and <code>$HADOOP_HOME</code> in the above
appropriately to suit your local environment. The content of <code>HADOOP_CLASSPATH</code>
is set to the HBase <code>CLASSPATH</code> via backticking the command
<code>${HBASE_HOME}/bin/hbase classpath</code>.
<p>When the above runs, internally, the HBase jar finds its zookeeper and
<a href="http://code.google.com/p/guava-libraries/">guava</a>,
etc., dependencies on the passed
</code>HADOOP_CLASSPATH</code> and adds the found jars to the mapreduce
job configuration. See the source at
<code>TableMapReduceUtil#addDependencyJars(org.apache.hadoop.mapreduce.Job)</code>
for how this is done.
</p>
<p>The above may not work if you are running your HBase from its build directory;
i.e. you've done <code>$ mvn test install</code> at
<code>${HBASE_HOME}</code> and you are now
trying to use this build in your mapreduce job. If you get
<blockquote><pre>java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.mapreduce.RowCounter$RowCounterMapper
...
</pre></blockquote>
exception thrown, try doing the following:
<blockquote><pre>
$ HADOOP_CLASSPATH=${HBASE_HOME}/target/hbase-0.90.0-SNAPSHOT.jar:`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/target/hbase-0.90.0-SNAPSHOT.jar rowcounter usertable
</pre></blockquote>
Notice how we preface the backtick invocation setting
<code>HADOOP_CLASSPATH</code> with reference to the built HBase jar over in
the <code>target</code> directory.
</p>
<h2><a name="driver">Bundled HBase MapReduce Jobs</a></h2>
<p>The HBase jar also serves as a Driver for some bundled mapreduce jobs. To
learn about the bundled mapreduce jobs run:
<blockquote><pre>
$ ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-0.90.0-SNAPSHOT.jar
An example program must be given as the first argument.
Valid program names are:
copytable: Export a table from local cluster to peer cluster
completebulkload: Complete a bulk data load.
export: Write table data to HDFS.
import: Import data written by Export.
importtsv: Import data in TSV format.
rowcounter: Count rows in HBase table
</pre></blockquote>
<h2><a name="sink">HBase as MapReduce job data source and sink</a></h2>
<p>HBase can be used as a data source, {@link org.apache.hadoop.hbase.mapreduce.TableInputFormat TableInputFormat},
and data sink, {@link org.apache.hadoop.hbase.mapreduce.TableOutputFormat TableOutputFormat}
or {@link org.apache.hadoop.hbase.mapreduce.MultiTableOutputFormat MultiTableOutputFormat},
for MapReduce jobs.
Writing MapReduce jobs that read or write HBase, you'll probably want to subclass
{@link org.apache.hadoop.hbase.mapreduce.TableMapper TableMapper} and/or
{@link org.apache.hadoop.hbase.mapreduce.TableReducer TableReducer}. See the do-nothing
pass-through classes {@link org.apache.hadoop.hbase.mapreduce.IdentityTableMapper IdentityTableMapper} and
{@link org.apache.hadoop.hbase.mapreduce.IdentityTableReducer IdentityTableReducer} for basic usage. For a more
involved example, see {@link org.apache.hadoop.hbase.mapreduce.RowCounter}
or review the <code>org.apache.hadoop.hbase.mapreduce.TestTableMapReduce</code> unit test.
</p>
<p>Running mapreduce jobs that have HBase as source or sink, you'll need to
specify source/sink table and column names in your configuration.</p>
<p>Reading from HBase, the TableInputFormat asks HBase for the list of
regions and makes a map-per-region or <code>mapreduce.job.maps maps</code>,
whichever is smaller (If your job only has two maps, up mapreduce.job.maps
to a number &gt; number of regions). Maps will run on the adjacent TaskTracker
if you are running a TaskTracer and RegionServer per node.
Writing, it may make sense to avoid the reduce step and write yourself back into
HBase from inside your map. You'd do this when your job does not need the sort
and collation that mapreduce does on the map emitted data; on insert,
HBase 'sorts' so there is no point double-sorting (and shuffling data around
your mapreduce cluster) unless you need to. If you do not need the reduce,
you might just have your map emit counts of records processed just so the
framework's report at the end of your job has meaning or set the number of
reduces to zero and use TableOutputFormat. See example code
below. If running the reduce step makes sense in your case, its usually better
to have lots of reducers so load is spread across the HBase cluster.</p>
<p>There is also a new HBase partitioner that will run as many reducers as
currently existing regions. The
{@link org.apache.hadoop.hbase.mapreduce.HRegionPartitioner} is suitable
when your table is large and your upload is not such that it will greatly
alter the number of existing regions when done; otherwise use the default
partitioner.
</p>
<h2><a name="bulk">Bulk import writing HFiles directly</a></h2>
<p>If importing into a new table, its possible to by-pass the HBase API
and write your content directly to the filesystem properly formatted as
HBase data files (HFiles). Your import will run faster, perhaps an order of
magnitude faster if not more. For more on how this mechanism works, see
<a href="http://hbase.apache.org/bulk-loads.html">Bulk Loads</code>
documentation.
</p>
<h2><a name="examples">Example Code</a></h2>
<h3>Sample Row Counter</h3>
<p>See {@link org.apache.hadoop.hbase.mapreduce.RowCounter}. This job uses
{@link org.apache.hadoop.hbase.mapreduce.TableInputFormat TableInputFormat} and
does a count of all rows in specified table.
You should be able to run
it by doing: <code>% ./bin/hadoop jar hbase-X.X.X.jar</code>. This will invoke
the hbase MapReduce Driver class. Select 'rowcounter' from the choice of jobs
offered. This will emit rowcouner 'usage'. Specify tablename, column to count
and output directory. You may need to add the hbase conf directory to <code>$HADOOP_HOME/conf/hadoop-env.sh#HADOOP_CLASSPATH</code>
so the rowcounter gets pointed at the right hbase cluster (or, build a new jar
with an appropriate hbase-site.xml built into your job jar).
</p>
*/ */
package org.apache.hadoop.hbase.mapreduce; package org.apache.hadoop.hbase.mapreduce;

View File

@ -664,18 +664,76 @@ htable.put(put);
<!-- schema design --> <!-- schema design -->
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="schema_design.xml"/> <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="schema_design.xml"/>
<chapter xml:id="mapreduce"> <chapter
xml:id="mapreduce">
<title>HBase and MapReduce</title> <title>HBase and MapReduce</title>
<para>See <link xlink:href="http://hbase.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#package_description"> <para>Apache MapReduce is a software framework used to analyze large amounts of data, and is
HBase and MapReduce</link> up in javadocs. the framework used most often with <link
Start there. Below is some additional help.</para> xlink:href="http://hadoop.apache.org/">Apache Hadoop</link>. MapReduce itself is out of the
<para>For more information about MapReduce (i.e., the framework in general), see the Hadoop site (TODO: Need good links here -- scope of this document. A good place to get started with MapReduce is <link
we used to have some but they rotted against apache hadoop).</para> xlink:href="http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html" />. MapReduce version
2 (MR2)is now part of <link
xlink:href="http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/">YARN</link>. </para>
<para> This chapter discusses specific configuration steps you need to take to use MapReduce on
data within HBase. In addition, it discusses other interactions and issues between HBase and
MapReduce jobs.
<note>
<title>mapred and mapreduce</title>
<para>There are two mapreduce packages in HBase as in MapReduce itself: <filename>org.apache.hadoop.hbase.mapred</filename>
and <filename>org.apache.hadoop.hbase.mapreduce</filename>. The former does old-style API and the latter
the new style. The latter has more facility though you can usually find an equivalent in the older
package. Pick the package that goes with your mapreduce deploy. When in doubt or starting over, pick the
<filename>org.apache.hadoop.hbase.mapreduce</filename>. In the notes below, we refer to
o.a.h.h.mapreduce but replace with the o.a.h.h.mapred if that is what you are using.
</para>
</note>
</para>
<section
xml:id="hbase.mapreduce.classpath">
<title>HBase, MapReduce, and the CLASSPATH</title>
<para>Ny default, MapReduce jobs deployed to a MapReduce cluster do not have access to either
the HBase configuration under <envar>$HBASE_CONF_DIR</envar> or the HBase classes.</para>
<para>To give the MapReduce jobs the access they need, you could add
<filename>hbase-site.xml</filename> to the
<filename><replaceable>$HADOOP_HOME</replaceable>/conf/</filename> directory and add the
HBase JARs to the <filename><replaceable>HADOOP_HOME</replaceable>/conf/</filename>
directory, then copy these changes across your cluster. You could add hbase-site.xml to
$HADOOP_HOME/conf and add HBase jars to the $HADOOP_HOME/lib. You would then need to copy
these changes across your cluster or edit
<filename><replaceable>$HADOOP_HOME</replaceable>conf/hadoop-env.sh</filename> and add
them to the <envar>HADOOP_CLASSPATH</envar> variable. However, this approach is not
recommended because it will pollute your Hadoop install with HBase references. It also
requires you to restart the Hadoop cluster before Hadoop can use the HBase data.</para>
<para> Since HBase 0.90.x, HBase adds its dependency JARs to the job configuration itself. The
dependencies only need to be available on the local CLASSPATH. The following example runs
the bundled HBase <link
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html">RowCounter</link>
MapReduce job against a table named <systemitem>usertable</systemitem> If you have not set
the environment variables expected in the command (the parts prefixed by a
<literal>$</literal> sign and curly braces), you can use the actual system paths instead.
Be sure to use the correct version of the HBase JAR for your system. The backticks
(<literal>`</literal> symbols) cause ths shell to execute the sub-commands, setting the
CLASSPATH as part of the command. This example assumes you use a BASH-compatible shell. </para>
<screen>$ <userinput>HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-0.90.0.jar rowcounter usertable</userinput></screen>
<para>When the command runs, internally, the HBase JAR finds the dependencies it needs for
zookeeper, guava, and its other dependencies on the passed <envar>HADOOP_CLASSPATH</envar>
and adds the JARs to the MapReduce job configuration. See the source at
TableMapReduceUtil#addDependencyJars(org.apache.hadoop.mapreduce.Job) for how this is done. </para>
<note>
<para> The example may not work if you are running HBase from its build directory rather
than an installed location. You may see an error like the following:</para>
<screen>java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.mapreduce.RowCounter$RowCounterMapper</screen>
<para>If this occurs, try modifying the command as follows, so that it uses the HBase JARs
from the <filename>target/</filename> directory within the build environment.</para>
<screen>$ <userinput>HADOOP_CLASSPATH=${HBASE_HOME}/target/hbase-0.90.0-SNAPSHOT.jar:`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/target/hbase-0.90.0-SNAPSHOT.jar rowcounter usertable</userinput></screen>
</note>
<caution> <caution>
<title>Notice to Mapreduce users of HBase 0.96.1 and above</title> <title>Notice to Mapreduce users of HBase 0.96.1 and above</title>
<para>Some mapreduce jobs that use HBase fail to launch. The symptom is an <para>Some mapreduce jobs that use HBase fail to launch. The symptom is an exception similar
exception similar to the following: to the following:</para>
<programlisting> <screen>
Exception in thread "main" java.lang.IllegalAccessError: class Exception in thread "main" java.lang.IllegalAccessError: class
com.google.protobuf.ZeroCopyLiteralByteString cannot access its superclass com.google.protobuf.ZeroCopyLiteralByteString cannot access its superclass
com.google.protobuf.LiteralByteString com.google.protobuf.LiteralByteString
@ -703,62 +761,157 @@ Exception in thread "main" java.lang.IllegalAccessError: class
at at
org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:100) org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:100)
... ...
</programlisting> </screen>
This is because of an optimization introduced in <link <para>This is caused by an optimization introduced in <link
xlink:href="https://issues.apache.org/jira/browse/HBASE-9867">HBASE-9867</link> xlink:href="https://issues.apache.org/jira/browse/HBASE-9867">HBASE-9867</link> that
that inadvertently introduced a classloader dependency. inadvertently introduced a classloader dependency. </para>
</para> <para>This affects both jobs using the <code>-libjars</code> option and "fat jar," those
<para>This affects both jobs using the <code>-libjars</code> option and which package their runtime dependencies in a nested <code>lib</code> folder.</para>
"fat jar," those which package their runtime dependencies in a nested <para>In order to satisfy the new classloader requirements, hbase-protocol.jar must be
<code>lib</code> folder.</para> included in Hadoop's classpath. See <xref
<para>In order to satisfy the new classloader requirements, linkend="hbase.mapreduce.classpath" /> for current recommendations for resolving
hbase-protocol.jar must be included in Hadoop's classpath. This can be classpath errors. The following is included for historical purposes.</para>
resolved system-wide by including a reference to the hbase-protocol.jar in <para>This can be resolved system-wide by including a reference to the hbase-protocol.jar in
hadoop's lib directory, via a symlink or by copying the jar into the new hadoop's lib directory, via a symlink or by copying the jar into the new location.</para>
location.</para> <para>This can also be achieved on a per-job launch basis by including it in the
<para>This can also be achieved on a per-job launch basis by including it <code>HADOOP_CLASSPATH</code> environment variable at job submission time. When
in the <code>HADOOP_CLASSPATH</code> environment variable at job submission launching jobs that package their dependencies, all three of the following job launching
time. When launching jobs that package their dependencies, all three of the commands satisfy this requirement:</para>
following job launching commands satisfy this requirement:</para> <screen>
<programlisting> $ <userinput>HADOOP_CLASSPATH=/path/to/hbase-protocol.jar:/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass</userinput>
$ HADOOP_CLASSPATH=/path/to/hbase-protocol.jar:/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass $ <userinput>HADOOP_CLASSPATH=$(hbase mapredcp):/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass</userinput>
$ HADOOP_CLASSPATH=$(hbase mapredcp):/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass $ <userinput>HADOOP_CLASSPATH=$(hbase classpath) hadoop jar MyJob.jar MyJobMainClass</userinput>
$ HADOOP_CLASSPATH=$(hbase classpath) hadoop jar MyJob.jar MyJobMainClass </screen>
</programlisting> <para>For jars that do not package their dependencies, the following command structure is
<para>For jars that do not package their dependencies, the following command necessary:</para>
structure is necessary:</para> <screen>
<programlisting> $ <userinput>HADOOP_CLASSPATH=$(hbase mapredcp):/etc/hbase/conf hadoop jar MyApp.jar MyJobMainClass -libjars $(hbase mapredcp | tr ':' ',')</userinput> ...
$ HADOOP_CLASSPATH=$(hbase mapredcp):/etc/hbase/conf hadoop jar MyApp.jar MyJobMainClass -libjars $(hbase mapredcp | tr ':' ',') ... </screen>
</programlisting>
<para>See also <link <para>See also <link
xlink:href="https://issues.apache.org/jira/browse/HBASE-10304">HBASE-10304</link> xlink:href="https://issues.apache.org/jira/browse/HBASE-10304">HBASE-10304</link> for
for further discussion of this issue.</para> further discussion of this issue.</para>
</caution> </caution>
<section xml:id="splitter"> </section>
<section>
<title>Bundled HBase MapReduce Jobs</title>
<para>The HBase JAR also serves as a Driver for some bundled mapreduce jobs. To learn about
the bundled MapReduce jobs, run the following command.</para>
<screen>$ <userinput>${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-0.90.0-SNAPSHOT.jar</userinput>
<computeroutput>An example program must be given as the first argument.
Valid program names are:
copytable: Export a table from local cluster to peer cluster
completebulkload: Complete a bulk data load.
export: Write table data to HDFS.
import: Import data written by Export.
importtsv: Import data in TSV format.
rowcounter: Count rows in HBase table</computeroutput>
</screen>
<para>Each of the valid program names are bundled MapReduce jobs. To run one of the jobs,
model your command after the following example.</para>
<screen>$ <userinput>${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-0.90.0-SNAPSHOT.jar rowcounter myTable</userinput></screen>
</section>
<section>
<title>HBase as a MapReduce Job Data Source and Data Sink</title>
<para>HBase can be used as a data source, <link
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html">TableInputFormat</link>,
and data sink, <link
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html">TableOutputFormat</link>
or <link
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/MultiTableOutputFormat.html">MultiTableOutputFormat</link>,
for MapReduce jobs. Writing MapReduce jobs that read or write HBase, it is advisable to
subclass <link
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableMapper.html">TableMapper</link>
and/or <link
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableReducer.html">TableReducer</link>.
See the do-nothing pass-through classes <link
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/IdentityTableMapper.html">IdentityTableMapper</link>
and <link
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/IdentityTableReducer.html">IdentityTableReducer</link>
for basic usage. For a more involved example, see <link
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html">RowCounter</link>
or review the <code>org.apache.hadoop.hbase.mapreduce.TestTableMapReduce</code> unit test. </para>
<para>If you run MapReduce jobs that use HBase as source or sink, need to specify source and
sink table and column names in your configuration.</para>
<para>When you read from HBase, the <code>TableInputFormat</code> requests the list of regions
from HBase and makes a map, which is either a <code>map-per-region</code> or
<code>mapreduce.job.maps</code> map, whichever is smaller. If your job only has two maps,
raise <code>mapreduce.job.maps</code> to a number greater than the number of regions. Maps
will run on the adjacent TaskTracker if you are running a TaskTracer and RegionServer per
node. When writing to HBase, it may make sense to avoid the Reduce step and write back into
HBase from within your map. This approach works when your job does not need the sort and
collation that MapReduce does on the map-emitted data. On insert, HBase 'sorts' so there is
no point double-sorting (and shuffling data around your MapReduce cluster) unless you need
to. If you do not need the Reduce, you myour map might emit counts of records processed for
reporting at the end of the jobj, or set the number of Reduces to zero and use
TableOutputFormat. If running the Reduce step makes sense in your case, you should typically
use multiple reducers so that load is spread across the HBase cluster.</para>
<para>A new HBase partitioner, the <link
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/HRegionPartitioner.html">HRegionPartitioner</link>,
can run as many reducers the number of existing regions. The HRegionPartitioner is suitable
when your table is large and your upload will not greatly alter the number of existing
regions upon completion. Otherwise use the default partitioner. </para>
</section>
<section>
<title>Writing HFiles Directly During Bulk Import</title>
<para>If you are importing into a new table, you can bypass the HBase API and write your
content directly to the filesystem, formatted into HBase data files (HFiles). Your import
will run faster, perhaps an order of magnitude faster. For more on how this mechanism works,
see <xref
linkend="arch.bulk.load" />.</para>
</section>
<section>
<title>RowCounter Example</title>
<para>The included <link
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html">RowCounter</link>
MapReduce job uses <code>TableInputFormat</code> and does a count of all rows in the specified
table. To run it, use the following command: </para>
<screen>$ <userinput>./bin/hadoop jar hbase-X.X.X.jar</userinput></screen>
<para>This will
invoke the HBase MapReduce Driver class. Select <literal>rowcounter</literal> from the choice of jobs
offered. This will print rowcouner usage advice to standard output. Specify the tablename,
column to count, and output
directory. If you have classpath errors, see <xref linkend="hbase.mapreduce.classpath" />.</para>
</section>
<section
xml:id="splitter">
<title>Map-Task Splitting</title> <title>Map-Task Splitting</title>
<section xml:id="splitter.default"> <section
xml:id="splitter.default">
<title>The Default HBase MapReduce Splitter</title> <title>The Default HBase MapReduce Splitter</title>
<para>When <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html">TableInputFormat</link> <para>When <link
is used to source an HBase table in a MapReduce job, xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html">TableInputFormat</link>
its splitter will make a map task for each region of the table. is used to source an HBase table in a MapReduce job, its splitter will make a map task for
Thus, if there are 100 regions in the table, there will be each region of the table. Thus, if there are 100 regions in the table, there will be 100
100 map-tasks for the job - regardless of how many column families are selected in the Scan.</para> map-tasks for the job - regardless of how many column families are selected in the
Scan.</para>
</section> </section>
<section xml:id="splitter.custom"> <section
xml:id="splitter.custom">
<title>Custom Splitters</title> <title>Custom Splitters</title>
<para>For those interested in implementing custom splitters, see the method <code>getSplits</code> in <para>For those interested in implementing custom splitters, see the method
<link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.html">TableInputFormatBase</link>. <code>getSplits</code> in <link
That is where the logic for map-task assignment resides. xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.html">TableInputFormatBase</link>.
</para> That is where the logic for map-task assignment resides. </para>
</section> </section>
</section> </section>
<section xml:id="mapreduce.example"> <section
xml:id="mapreduce.example">
<title>HBase MapReduce Examples</title> <title>HBase MapReduce Examples</title>
<section xml:id="mapreduce.example.read"> <section
xml:id="mapreduce.example.read">
<title>HBase MapReduce Read Example</title> <title>HBase MapReduce Read Example</title>
<para>The following is an example of using HBase as a MapReduce source in read-only manner. Specifically, <para>The following is an example of using HBase as a MapReduce source in read-only manner.
there is a Mapper instance but no Reducer, and nothing is being emitted from the Mapper. There job would be defined Specifically, there is a Mapper instance but no Reducer, and nothing is being emitted from
as follows... the Mapper. There job would be defined as follows...</para>
<programlisting> <programlisting>
Configuration config = HBaseConfiguration.create(); Configuration config = HBaseConfiguration.create();
Job job = new Job(config, "ExampleRead"); Job job = new Job(config, "ExampleRead");
@ -784,7 +937,8 @@ if (!b) {
throw new IOException("error with job!"); throw new IOException("error with job!");
} }
</programlisting> </programlisting>
...and the mapper instance would extend <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableMapper.html">TableMapper</link>... <para>...and the mapper instance would extend <link
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableMapper.html">TableMapper</link>...</para>
<programlisting> <programlisting>
public static class MyMapper extends TableMapper&lt;Text, Text&gt; { public static class MyMapper extends TableMapper&lt;Text, Text&gt; {
@ -793,12 +947,12 @@ public static class MyMapper extends TableMapper&lt;Text, Text&gt; {
} }
} }
</programlisting> </programlisting>
</para>
</section> </section>
<section xml:id="mapreduce.example.readwrite"> <section
xml:id="mapreduce.example.readwrite">
<title>HBase MapReduce Read/Write Example</title> <title>HBase MapReduce Read/Write Example</title>
<para>The following is an example of using HBase both as a source and as a sink with MapReduce. <para>The following is an example of using HBase both as a source and as a sink with
This example will simply copy data from one table to another.</para> MapReduce. This example will simply copy data from one table to another.</para>
<programlisting> <programlisting>
Configuration config = HBaseConfiguration.create(); Configuration config = HBaseConfiguration.create();
Job job = new Job(config,"ExampleReadWrite"); Job job = new Job(config,"ExampleReadWrite");
@ -827,14 +981,17 @@ if (!b) {
throw new IOException("error with job!"); throw new IOException("error with job!");
} }
</programlisting> </programlisting>
<para>An explanation is required of what <classname>TableMapReduceUtil</classname> is doing, especially with the reducer. <para>An explanation is required of what <classname>TableMapReduceUtil</classname> is doing,
<link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html">TableOutputFormat</link> is being used especially with the reducer. <link
as the outputFormat class, and several parameters are being set on the config (e.g., TableOutputFormat.OUTPUT_TABLE), as xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html">TableOutputFormat</link>
well as setting the reducer output key to <classname>ImmutableBytesWritable</classname> and reducer value to <classname>Writable</classname>. is being used as the outputFormat class, and several parameters are being set on the
These could be set by the programmer on the job and conf, but <classname>TableMapReduceUtil</classname> tries to make things easier.</para> config (e.g., TableOutputFormat.OUTPUT_TABLE), as well as setting the reducer output key
<para>The following is the example mapper, which will create a <classname>Put</classname> and matching the input <classname>Result</classname> to <classname>ImmutableBytesWritable</classname> and reducer value to
and emit it. Note: this is what the CopyTable utility does. <classname>Writable</classname>. These could be set by the programmer on the job and
</para> conf, but <classname>TableMapReduceUtil</classname> tries to make things easier.</para>
<para>The following is the example mapper, which will create a <classname>Put</classname>
and matching the input <classname>Result</classname> and emit it. Note: this is what the
CopyTable utility does. </para>
<programlisting> <programlisting>
public static class MyMapper extends TableMapper&lt;ImmutableBytesWritable, Put&gt; { public static class MyMapper extends TableMapper&lt;ImmutableBytesWritable, Put&gt; {
@ -852,22 +1009,23 @@ public static class MyMapper extends TableMapper&lt;ImmutableBytesWritable, Put&
} }
} }
</programlisting> </programlisting>
<para>There isn't actually a reducer step, so <classname>TableOutputFormat</classname> takes care of sending the <classname>Put</classname> <para>There isn't actually a reducer step, so <classname>TableOutputFormat</classname> takes
to the target table. care of sending the <classname>Put</classname> to the target table. </para>
</para> <para>This is just an example, developers could choose not to use
<para>This is just an example, developers could choose not to use <classname>TableOutputFormat</classname> and connect to the <classname>TableOutputFormat</classname> and connect to the target table themselves.
target table themselves.
</para> </para>
</section> </section>
<section xml:id="mapreduce.example.readwrite.multi"> <section
xml:id="mapreduce.example.readwrite.multi">
<title>HBase MapReduce Read/Write Example With Multi-Table Output</title> <title>HBase MapReduce Read/Write Example With Multi-Table Output</title>
<para>TODO: example for <classname>MultiTableOutputFormat</classname>. <para>TODO: example for <classname>MultiTableOutputFormat</classname>. </para>
</para>
</section> </section>
<section xml:id="mapreduce.example.summary"> <section
xml:id="mapreduce.example.summary">
<title>HBase MapReduce Summary to HBase Example</title> <title>HBase MapReduce Summary to HBase Example</title>
<para>The following example uses HBase as a MapReduce source and sink with a summarization step. This example will <para>The following example uses HBase as a MapReduce source and sink with a summarization
count the number of distinct instances of a value in a table and write those summarized counts in another table. step. This example will count the number of distinct instances of a value in a table and
write those summarized counts in another table.
<programlisting> <programlisting>
Configuration config = HBaseConfiguration.create(); Configuration config = HBaseConfiguration.create();
Job job = new Job(config,"ExampleSummary"); Job job = new Job(config,"ExampleSummary");
@ -896,8 +1054,9 @@ if (!b) {
throw new IOException("error with job!"); throw new IOException("error with job!");
} }
</programlisting> </programlisting>
In this example mapper a column with a String-value is chosen as the value to summarize upon. In this example mapper a column with a String-value is chosen as the value to summarize
This value is used as the key to emit from the mapper, and an <classname>IntWritable</classname> represents an instance counter. upon. This value is used as the key to emit from the mapper, and an
<classname>IntWritable</classname> represents an instance counter.
<programlisting> <programlisting>
public static class MyMapper extends TableMapper&lt;Text, IntWritable&gt; { public static class MyMapper extends TableMapper&lt;Text, IntWritable&gt; {
public static final byte[] CF = "cf".getBytes(); public static final byte[] CF = "cf".getBytes();
@ -914,7 +1073,8 @@ public static class MyMapper extends TableMapper&lt;Text, IntWritable&gt; {
} }
} }
</programlisting> </programlisting>
In the reducer, the "ones" are counted (just like any other MR example that does this), and then emits a <classname>Put</classname>. In the reducer, the "ones" are counted (just like any other MR example that does this),
and then emits a <classname>Put</classname>.
<programlisting> <programlisting>
public static class MyTableReducer extends TableReducer&lt;Text, IntWritable, ImmutableBytesWritable&gt; { public static class MyTableReducer extends TableReducer&lt;Text, IntWritable, ImmutableBytesWritable&gt; {
public static final byte[] CF = "cf".getBytes(); public static final byte[] CF = "cf".getBytes();
@ -934,11 +1094,12 @@ public static class MyTableReducer extends TableReducer&lt;Text, IntWritable, Im
</programlisting> </programlisting>
</para> </para>
</section> </section>
<section xml:id="mapreduce.example.summary.file"> <section
xml:id="mapreduce.example.summary.file">
<title>HBase MapReduce Summary to File Example</title> <title>HBase MapReduce Summary to File Example</title>
<para>This very similar to the summary example above, with exception that this is using HBase as a MapReduce source <para>This very similar to the summary example above, with exception that this is using
but HDFS as the sink. The differences are in the job setup and in the reducer. The mapper remains the same. HBase as a MapReduce source but HDFS as the sink. The differences are in the job setup and
</para> in the reducer. The mapper remains the same. </para>
<programlisting> <programlisting>
Configuration config = HBaseConfiguration.create(); Configuration config = HBaseConfiguration.create();
Job job = new Job(config,"ExampleSummaryToFile"); Job job = new Job(config,"ExampleSummaryToFile");
@ -965,8 +1126,9 @@ if (!b) {
throw new IOException("error with job!"); throw new IOException("error with job!");
} }
</programlisting> </programlisting>
<para>As stated above, the previous Mapper can run unchanged with this example. <para>As stated above, the previous Mapper can run unchanged with this example. As for the
As for the Reducer, it is a "generic" Reducer instead of extending TableMapper and emitting Puts.</para> Reducer, it is a "generic" Reducer instead of extending TableMapper and emitting
Puts.</para>
<programlisting> <programlisting>
public static class MyReducer extends Reducer&lt;Text, IntWritable, Text, IntWritable&gt; { public static class MyReducer extends Reducer&lt;Text, IntWritable, Text, IntWritable&gt; {
@ -980,31 +1142,33 @@ if (!b) {
} }
</programlisting> </programlisting>
</section> </section>
<section xml:id="mapreduce.example.summary.noreducer"> <section
xml:id="mapreduce.example.summary.noreducer">
<title>HBase MapReduce Summary to HBase Without Reducer</title> <title>HBase MapReduce Summary to HBase Without Reducer</title>
<para>It is also possible to perform summaries without a reducer - if you use HBase as the reducer. <para>It is also possible to perform summaries without a reducer - if you use HBase as the
</para> reducer. </para>
<para>An HBase target table would need to exist for the job summary. The HTable method <code>incrementColumnValue</code> <para>An HBase target table would need to exist for the job summary. The HTable method
would be used to atomically increment values. From a performance perspective, it might make sense to keep a Map <code>incrementColumnValue</code> would be used to atomically increment values. From a
of values with their values to be incremeneted for each map-task, and make one update per key at during the <code> performance perspective, it might make sense to keep a Map of values with their values to
cleanup</code> method of the mapper. However, your milage may vary depending on the number of rows to be processed and be incremeneted for each map-task, and make one update per key at during the <code>
unique keys. cleanup</code> method of the mapper. However, your milage may vary depending on the
</para> number of rows to be processed and unique keys. </para>
<para>In the end, the summary results are in HBase. <para>In the end, the summary results are in HBase. </para>
</para>
</section> </section>
<section xml:id="mapreduce.example.summary.rdbms"> <section
xml:id="mapreduce.example.summary.rdbms">
<title>HBase MapReduce Summary to RDBMS</title> <title>HBase MapReduce Summary to RDBMS</title>
<para>Sometimes it is more appropriate to generate summaries to an RDBMS. For these cases, it is possible <para>Sometimes it is more appropriate to generate summaries to an RDBMS. For these cases,
to generate summaries directly to an RDBMS via a custom reducer. The <code>setup</code> method it is possible to generate summaries directly to an RDBMS via a custom reducer. The
can connect to an RDBMS (the connection information can be passed via custom parameters in the context) and the <code>setup</code> method can connect to an RDBMS (the connection information can be
cleanup method can close the connection. passed via custom parameters in the context) and the cleanup method can close the
</para> connection. </para>
<para>It is critical to understand that number of reducers for the job affects the summarization implementation, and <para>It is critical to understand that number of reducers for the job affects the
you'll have to design this into your reducer. Specifically, whether it is designed to run as a singleton (one reducer) summarization implementation, and you'll have to design this into your reducer.
or multiple reducers. Neither is right or wrong, it depends on your use-case. Recognize that the more reducers that Specifically, whether it is designed to run as a singleton (one reducer) or multiple
are assigned to the job, the more simultaneous connections to the RDBMS will be created - this will scale, but only to a point. reducers. Neither is right or wrong, it depends on your use-case. Recognize that the more
</para> reducers that are assigned to the job, the more simultaneous connections to the RDBMS will
be created - this will scale, but only to a point. </para>
<programlisting> <programlisting>
public static class MyRdbmsReducer extends Reducer&lt;Text, IntWritable, Text, IntWritable&gt; { public static class MyRdbmsReducer extends Reducer&lt;Text, IntWritable, Text, IntWritable&gt; {
@ -1025,17 +1189,17 @@ if (!b) {
} }
</programlisting> </programlisting>
<para>In the end, the summary results are written to your RDBMS table/s. <para>In the end, the summary results are written to your RDBMS table/s. </para>
</para>
</section> </section>
</section> <!-- mr examples --> </section>
<section xml:id="mapreduce.htable.access"> <!-- mr examples -->
<section
xml:id="mapreduce.htable.access">
<title>Accessing Other HBase Tables in a MapReduce Job</title> <title>Accessing Other HBase Tables in a MapReduce Job</title>
<para>Although the framework currently allows one HBase table as input to a <para>Although the framework currently allows one HBase table as input to a MapReduce job,
MapReduce job, other HBase tables can other HBase tables can be accessed as lookup tables, etc., in a MapReduce job via creating
be accessed as lookup tables, etc., in a an HTable instance in the setup method of the Mapper.
MapReduce job via creating an HTable instance in the setup method of the Mapper.
<programlisting>public class MyMapper extends TableMapper&lt;Text, LongWritable&gt; { <programlisting>public class MyMapper extends TableMapper&lt;Text, LongWritable&gt; {
private HTable myOtherTable; private HTable myOtherTable;
@ -1051,17 +1215,16 @@ if (!b) {
</programlisting> </programlisting>
</para> </para>
</section> </section>
<section xml:id="mapreduce.specex"> <section
xml:id="mapreduce.specex">
<title>Speculative Execution</title> <title>Speculative Execution</title>
<para>It is generally advisable to turn off speculative execution for <para>It is generally advisable to turn off speculative execution for MapReduce jobs that use
MapReduce jobs that use HBase as a source. This can either be done on a HBase as a source. This can either be done on a per-Job basis through properties, on on the
per-Job basis through properties, on on the entire cluster. Especially entire cluster. Especially for longer running jobs, speculative execution will create
for longer running jobs, speculative execution will create duplicate duplicate map-tasks which will double-write your data to HBase; this is probably not what
map-tasks which will double-write your data to HBase; this is probably you want. </para>
not what you want. <para>See <xref
</para> linkend="spec.ex" /> for more information. </para>
<para>See <xref linkend="spec.ex"/> for more information.
</para>
</section> </section>
</chapter> <!-- mapreduce --> </chapter> <!-- mapreduce -->