HBASE-8807 HBase MapReduce Job-Launch Documentation Misplaced (Misty Stanley-Jones)

This commit is contained in:
Michael Stack 2014-05-27 15:13:23 -07:00
parent d9739b9e3f
commit ef995efb1a
3 changed files with 336 additions and 407 deletions

View File

@ -20,104 +20,7 @@
Provides HBase <a href="http://wiki.apache.org/hadoop/HadoopMapReduce">MapReduce</a>
Input/OutputFormats, a table indexing MapReduce job, and utility
<h2>Table of Contents</h2>
<ul>
<li><a href="#classpath">HBase, MapReduce and the CLASSPATH</a></li>
<li><a href="#sink">HBase as MapReduce job data source and sink</a></li>
<li><a href="#examples">Example Code</a></li>
</ul>
<h2><a name="classpath">HBase, MapReduce and the CLASSPATH</a></h2>
<p>MapReduce jobs deployed to a MapReduce cluster do not by default have access
to the HBase configuration under <code>$HBASE_CONF_DIR</code> nor to HBase classes.
You could add <code>hbase-site.xml</code> to $HADOOP_HOME/conf and add
<code>hbase-X.X.X.jar</code> to the <code>$HADOOP_HOME/lib</code> and copy these
changes across your cluster but the cleanest means of adding hbase configuration
and classes to the cluster <code>CLASSPATH</code> is by uncommenting
<code>HADOOP_CLASSPATH</code> in <code>$HADOOP_HOME/conf/hadoop-env.sh</code>
adding hbase dependencies here. For example, here is how you would amend
<code>hadoop-env.sh</code> adding the
built hbase jar, zookeeper (needed by hbase client), hbase conf, and the
<code>PerformanceEvaluation</code> class from the built hbase test jar to the
hadoop <code>CLASSPATH</code>:
<blockquote><pre># Extra Java CLASSPATH elements. Optional.
# export HADOOP_CLASSPATH=
export HADOOP_CLASSPATH=$HBASE_HOME/build/hbase-X.X.X.jar:$HBASE_HOME/build/hbase-X.X.X-test.jar:$HBASE_HOME/conf:${HBASE_HOME}/lib/zookeeper-X.X.X.jar</pre></blockquote>
<p>Expand <code>$HBASE_HOME</code> in the above appropriately to suit your
local environment.</p>
<p>After copying the above change around your cluster (and restarting), this is
how you would run the PerformanceEvaluation MR job to put up 4 clients (Presumes
a ready mapreduce cluster):
<blockquote><pre>$HADOOP_HOME/bin/hadoop org.apache.hadoop.hbase.PerformanceEvaluation sequentialWrite 4</pre></blockquote>
The PerformanceEvaluation class wil be found on the CLASSPATH because you
added <code>$HBASE_HOME/build/test</code> to HADOOP_CLASSPATH
</p>
<p>Another possibility, if for example you do not have access to hadoop-env.sh or
are unable to restart the hadoop cluster, is bundling the hbase jar into a mapreduce
job jar adding it and its dependencies under the job jar <code>lib/</code>
directory and the hbase conf into a job jar <code>conf/</code> directory.
</a>
<h2><a name="sink">HBase as MapReduce job data source and sink</a></h2>
<p>HBase can be used as a data source, {@link org.apache.hadoop.hbase.mapred.TableInputFormat TableInputFormat},
and data sink, {@link org.apache.hadoop.hbase.mapred.TableOutputFormat TableOutputFormat}, for MapReduce jobs.
Writing MapReduce jobs that read or write HBase, you'll probably want to subclass
{@link org.apache.hadoop.hbase.mapred.TableMap TableMap} and/or
{@link org.apache.hadoop.hbase.mapred.TableReduce TableReduce}. See the do-nothing
pass-through classes {@link org.apache.hadoop.hbase.mapred.IdentityTableMap IdentityTableMap} and
{@link org.apache.hadoop.hbase.mapred.IdentityTableReduce IdentityTableReduce} for basic usage. For a more
involved example, see <code>BuildTableIndex</code>
or review the <code>org.apache.hadoop.hbase.mapred.TestTableMapReduce</code> unit test.
</p>
<p>Running mapreduce jobs that have hbase as source or sink, you'll need to
specify source/sink table and column names in your configuration.</p>
<p>Reading from hbase, the TableInputFormat asks hbase for the list of
regions and makes a map-per-region or <code>mapred.map.tasks maps</code>,
whichever is smaller (If your job only has two maps, up mapred.map.tasks
to a number > number of regions). Maps will run on the adjacent TaskTracker
if you are running a TaskTracer and RegionServer per node.
Writing, it may make sense to avoid the reduce step and write yourself back into
hbase from inside your map. You'd do this when your job does not need the sort
and collation that mapreduce does on the map emitted data; on insert,
hbase 'sorts' so there is no point double-sorting (and shuffling data around
your mapreduce cluster) unless you need to. If you do not need the reduce,
you might just have your map emit counts of records processed just so the
framework's report at the end of your job has meaning or set the number of
reduces to zero and use TableOutputFormat. See example code
below. If running the reduce step makes sense in your case, its usually better
to have lots of reducers so load is spread across the hbase cluster.</p>
<p>There is also a new hbase partitioner that will run as many reducers as
currently existing regions. The
{@link org.apache.hadoop.hbase.mapred.HRegionPartitioner} is suitable
when your table is large and your upload is not such that it will greatly
alter the number of existing regions when done; other use the default
partitioner.
</p>
<h2><a name="examples">Example Code</a></h2>
<h3>Sample Row Counter</h3>
<p>See {@link org.apache.hadoop.hbase.mapred.RowCounter}. You should be able to run
it by doing: <code>% ./bin/hadoop jar hbase-X.X.X.jar</code>. This will invoke
the hbase MapReduce Driver class. Select 'rowcounter' from the choice of jobs
offered. You may need to add the hbase conf directory to <code>$HADOOP_HOME/conf/hadoop-env.sh#HADOOP_CLASSPATH</code>
so the rowcounter gets pointed at the right hbase cluster (or, build a new jar
with an appropriate hbase-site.xml built into your job jar).
</p>
<h3>PerformanceEvaluation</h3>
<p>See org.apache.hadoop.hbase.PerformanceEvaluation from hbase src/test. It runs
a mapreduce job to run concurrent clients reading and writing hbase.
</p>
<p>See <a href="http://hbase.apache.org/book.html#mapreduce">HBase and MapReduce</a>
in the HBase Reference Guide for mapreduce over hbase documentation.
*/
package org.apache.hadoop.hbase.mapred;

View File

@ -20,144 +20,7 @@
Provides HBase <a href="http://wiki.apache.org/hadoop/HadoopMapReduce">MapReduce</a>
Input/OutputFormats, a table indexing MapReduce job, and utility
<h2>Table of Contents</h2>
<ul>
<li><a href="#classpath">HBase, MapReduce and the CLASSPATH</a></li>
<li><a href="#driver">Bundled HBase MapReduce Jobs</a></li>
<li><a href="#sink">HBase as MapReduce job data source and sink</a></li>
<li><a href="#bulk">Bulk Import writing HFiles directly</a></li>
<li><a href="#examples">Example Code</a></li>
</ul>
<h2><a name="classpath">HBase, MapReduce and the CLASSPATH</a></h2>
<p>MapReduce jobs deployed to a MapReduce cluster do not by default have access
to the HBase configuration under <code>$HBASE_CONF_DIR</code> nor to HBase classes.
You could add <code>hbase-site.xml</code> to
<code>$HADOOP_HOME/conf</code> and add
HBase jars to the <code>$HADOOP_HOME/lib</code> and copy these
changes across your cluster (or edit conf/hadoop-env.sh and add them to the
<code>HADOOP_CLASSPATH</code> variable) but this will pollute your
hadoop install with HBase references; its also obnoxious requiring restart of
the hadoop cluster before it'll notice your HBase additions.</p>
<p>As of 0.90.x, HBase will just add its dependency jars to the job
configuration; the dependencies just need to be available on the local
<code>CLASSPATH</code>. For example, to run the bundled HBase
{@link org.apache.hadoop.hbase.mapreduce.RowCounter} mapreduce job against a table named <code>usertable</code>,
type:
<blockquote><pre>
$ HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-0.90.0.jar rowcounter usertable
</pre></blockquote>
Expand <code>$HBASE_HOME</code> and <code>$HADOOP_HOME</code> in the above
appropriately to suit your local environment. The content of <code>HADOOP_CLASSPATH</code>
is set to the HBase <code>CLASSPATH</code> via backticking the command
<code>${HBASE_HOME}/bin/hbase classpath</code>.
<p>When the above runs, internally, the HBase jar finds its zookeeper and
<a href="http://code.google.com/p/guava-libraries/">guava</a>,
etc., dependencies on the passed
</code>HADOOP_CLASSPATH</code> and adds the found jars to the mapreduce
job configuration. See the source at
<code>TableMapReduceUtil#addDependencyJars(org.apache.hadoop.mapreduce.Job)</code>
for how this is done.
</p>
<p>The above may not work if you are running your HBase from its build directory;
i.e. you've done <code>$ mvn test install</code> at
<code>${HBASE_HOME}</code> and you are now
trying to use this build in your mapreduce job. If you get
<blockquote><pre>java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.mapreduce.RowCounter$RowCounterMapper
...
</pre></blockquote>
exception thrown, try doing the following:
<blockquote><pre>
$ HADOOP_CLASSPATH=${HBASE_HOME}/target/hbase-0.90.0-SNAPSHOT.jar:`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/target/hbase-0.90.0-SNAPSHOT.jar rowcounter usertable
</pre></blockquote>
Notice how we preface the backtick invocation setting
<code>HADOOP_CLASSPATH</code> with reference to the built HBase jar over in
the <code>target</code> directory.
</p>
<h2><a name="driver">Bundled HBase MapReduce Jobs</a></h2>
<p>The HBase jar also serves as a Driver for some bundled mapreduce jobs. To
learn about the bundled mapreduce jobs run:
<blockquote><pre>
$ ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-0.90.0-SNAPSHOT.jar
An example program must be given as the first argument.
Valid program names are:
copytable: Export a table from local cluster to peer cluster
completebulkload: Complete a bulk data load.
export: Write table data to HDFS.
import: Import data written by Export.
importtsv: Import data in TSV format.
rowcounter: Count rows in HBase table
</pre></blockquote>
<h2><a name="sink">HBase as MapReduce job data source and sink</a></h2>
<p>HBase can be used as a data source, {@link org.apache.hadoop.hbase.mapreduce.TableInputFormat TableInputFormat},
and data sink, {@link org.apache.hadoop.hbase.mapreduce.TableOutputFormat TableOutputFormat}
or {@link org.apache.hadoop.hbase.mapreduce.MultiTableOutputFormat MultiTableOutputFormat},
for MapReduce jobs.
Writing MapReduce jobs that read or write HBase, you'll probably want to subclass
{@link org.apache.hadoop.hbase.mapreduce.TableMapper TableMapper} and/or
{@link org.apache.hadoop.hbase.mapreduce.TableReducer TableReducer}. See the do-nothing
pass-through classes {@link org.apache.hadoop.hbase.mapreduce.IdentityTableMapper IdentityTableMapper} and
{@link org.apache.hadoop.hbase.mapreduce.IdentityTableReducer IdentityTableReducer} for basic usage. For a more
involved example, see {@link org.apache.hadoop.hbase.mapreduce.RowCounter}
or review the <code>org.apache.hadoop.hbase.mapreduce.TestTableMapReduce</code> unit test.
</p>
<p>Running mapreduce jobs that have HBase as source or sink, you'll need to
specify source/sink table and column names in your configuration.</p>
<p>Reading from HBase, the TableInputFormat asks HBase for the list of
regions and makes a map-per-region or <code>mapreduce.job.maps maps</code>,
whichever is smaller (If your job only has two maps, up mapreduce.job.maps
to a number &gt; number of regions). Maps will run on the adjacent TaskTracker
if you are running a TaskTracer and RegionServer per node.
Writing, it may make sense to avoid the reduce step and write yourself back into
HBase from inside your map. You'd do this when your job does not need the sort
and collation that mapreduce does on the map emitted data; on insert,
HBase 'sorts' so there is no point double-sorting (and shuffling data around
your mapreduce cluster) unless you need to. If you do not need the reduce,
you might just have your map emit counts of records processed just so the
framework's report at the end of your job has meaning or set the number of
reduces to zero and use TableOutputFormat. See example code
below. If running the reduce step makes sense in your case, its usually better
to have lots of reducers so load is spread across the HBase cluster.</p>
<p>There is also a new HBase partitioner that will run as many reducers as
currently existing regions. The
{@link org.apache.hadoop.hbase.mapreduce.HRegionPartitioner} is suitable
when your table is large and your upload is not such that it will greatly
alter the number of existing regions when done; otherwise use the default
partitioner.
</p>
<h2><a name="bulk">Bulk import writing HFiles directly</a></h2>
<p>If importing into a new table, its possible to by-pass the HBase API
and write your content directly to the filesystem properly formatted as
HBase data files (HFiles). Your import will run faster, perhaps an order of
magnitude faster if not more. For more on how this mechanism works, see
<a href="http://hbase.apache.org/bulk-loads.html">Bulk Loads</code>
documentation.
</p>
<h2><a name="examples">Example Code</a></h2>
<h3>Sample Row Counter</h3>
<p>See {@link org.apache.hadoop.hbase.mapreduce.RowCounter}. This job uses
{@link org.apache.hadoop.hbase.mapreduce.TableInputFormat TableInputFormat} and
does a count of all rows in specified table.
You should be able to run
it by doing: <code>% ./bin/hadoop jar hbase-X.X.X.jar</code>. This will invoke
the hbase MapReduce Driver class. Select 'rowcounter' from the choice of jobs
offered. This will emit rowcouner 'usage'. Specify tablename, column to count
and output directory. You may need to add the hbase conf directory to <code>$HADOOP_HOME/conf/hadoop-env.sh#HADOOP_CLASSPATH</code>
so the rowcounter gets pointed at the right hbase cluster (or, build a new jar
with an appropriate hbase-site.xml built into your job jar).
</p>
<p>See <a href="http://hbase.apache.org/book.html#mapreduce">HBase and MapReduce</a>
in the HBase Reference Guide for mapreduce over hbase documentation.
*/
package org.apache.hadoop.hbase.mapreduce;

View File

@ -664,18 +664,76 @@ htable.put(put);
<!-- schema design -->
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="schema_design.xml"/>
<chapter xml:id="mapreduce">
<title>HBase and MapReduce</title>
<para>See <link xlink:href="http://hbase.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#package_description">
HBase and MapReduce</link> up in javadocs.
Start there. Below is some additional help.</para>
<para>For more information about MapReduce (i.e., the framework in general), see the Hadoop site (TODO: Need good links here --
we used to have some but they rotted against apache hadoop).</para>
<caution>
<title>Notice to Mapreduce users of HBase 0.96.1 and above</title>
<para>Some mapreduce jobs that use HBase fail to launch. The symptom is an
exception similar to the following:
<programlisting>
<chapter
xml:id="mapreduce">
<title>HBase and MapReduce</title>
<para>Apache MapReduce is a software framework used to analyze large amounts of data, and is
the framework used most often with <link
xlink:href="http://hadoop.apache.org/">Apache Hadoop</link>. MapReduce itself is out of the
scope of this document. A good place to get started with MapReduce is <link
xlink:href="http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html" />. MapReduce version
2 (MR2)is now part of <link
xlink:href="http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/">YARN</link>. </para>
<para> This chapter discusses specific configuration steps you need to take to use MapReduce on
data within HBase. In addition, it discusses other interactions and issues between HBase and
MapReduce jobs.
<note>
<title>mapred and mapreduce</title>
<para>There are two mapreduce packages in HBase as in MapReduce itself: <filename>org.apache.hadoop.hbase.mapred</filename>
and <filename>org.apache.hadoop.hbase.mapreduce</filename>. The former does old-style API and the latter
the new style. The latter has more facility though you can usually find an equivalent in the older
package. Pick the package that goes with your mapreduce deploy. When in doubt or starting over, pick the
<filename>org.apache.hadoop.hbase.mapreduce</filename>. In the notes below, we refer to
o.a.h.h.mapreduce but replace with the o.a.h.h.mapred if that is what you are using.
</para>
</note>
</para>
<section
xml:id="hbase.mapreduce.classpath">
<title>HBase, MapReduce, and the CLASSPATH</title>
<para>Ny default, MapReduce jobs deployed to a MapReduce cluster do not have access to either
the HBase configuration under <envar>$HBASE_CONF_DIR</envar> or the HBase classes.</para>
<para>To give the MapReduce jobs the access they need, you could add
<filename>hbase-site.xml</filename> to the
<filename><replaceable>$HADOOP_HOME</replaceable>/conf/</filename> directory and add the
HBase JARs to the <filename><replaceable>HADOOP_HOME</replaceable>/conf/</filename>
directory, then copy these changes across your cluster. You could add hbase-site.xml to
$HADOOP_HOME/conf and add HBase jars to the $HADOOP_HOME/lib. You would then need to copy
these changes across your cluster or edit
<filename><replaceable>$HADOOP_HOME</replaceable>conf/hadoop-env.sh</filename> and add
them to the <envar>HADOOP_CLASSPATH</envar> variable. However, this approach is not
recommended because it will pollute your Hadoop install with HBase references. It also
requires you to restart the Hadoop cluster before Hadoop can use the HBase data.</para>
<para> Since HBase 0.90.x, HBase adds its dependency JARs to the job configuration itself. The
dependencies only need to be available on the local CLASSPATH. The following example runs
the bundled HBase <link
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html">RowCounter</link>
MapReduce job against a table named <systemitem>usertable</systemitem> If you have not set
the environment variables expected in the command (the parts prefixed by a
<literal>$</literal> sign and curly braces), you can use the actual system paths instead.
Be sure to use the correct version of the HBase JAR for your system. The backticks
(<literal>`</literal> symbols) cause ths shell to execute the sub-commands, setting the
CLASSPATH as part of the command. This example assumes you use a BASH-compatible shell. </para>
<screen>$ <userinput>HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-0.90.0.jar rowcounter usertable</userinput></screen>
<para>When the command runs, internally, the HBase JAR finds the dependencies it needs for
zookeeper, guava, and its other dependencies on the passed <envar>HADOOP_CLASSPATH</envar>
and adds the JARs to the MapReduce job configuration. See the source at
TableMapReduceUtil#addDependencyJars(org.apache.hadoop.mapreduce.Job) for how this is done. </para>
<note>
<para> The example may not work if you are running HBase from its build directory rather
than an installed location. You may see an error like the following:</para>
<screen>java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.mapreduce.RowCounter$RowCounterMapper</screen>
<para>If this occurs, try modifying the command as follows, so that it uses the HBase JARs
from the <filename>target/</filename> directory within the build environment.</para>
<screen>$ <userinput>HADOOP_CLASSPATH=${HBASE_HOME}/target/hbase-0.90.0-SNAPSHOT.jar:`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/target/hbase-0.90.0-SNAPSHOT.jar rowcounter usertable</userinput></screen>
</note>
<caution>
<title>Notice to Mapreduce users of HBase 0.96.1 and above</title>
<para>Some mapreduce jobs that use HBase fail to launch. The symptom is an exception similar
to the following:</para>
<screen>
Exception in thread "main" java.lang.IllegalAccessError: class
com.google.protobuf.ZeroCopyLiteralByteString cannot access its superclass
com.google.protobuf.LiteralByteString
@ -703,63 +761,158 @@ Exception in thread "main" java.lang.IllegalAccessError: class
at
org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:100)
...
</programlisting>
This is because of an optimization introduced in <link
xlink:href="https://issues.apache.org/jira/browse/HBASE-9867">HBASE-9867</link>
that inadvertently introduced a classloader dependency.
</para>
<para>This affects both jobs using the <code>-libjars</code> option and
"fat jar," those which package their runtime dependencies in a nested
<code>lib</code> folder.</para>
<para>In order to satisfy the new classloader requirements,
hbase-protocol.jar must be included in Hadoop's classpath. This can be
resolved system-wide by including a reference to the hbase-protocol.jar in
hadoop's lib directory, via a symlink or by copying the jar into the new
location.</para>
<para>This can also be achieved on a per-job launch basis by including it
in the <code>HADOOP_CLASSPATH</code> environment variable at job submission
time. When launching jobs that package their dependencies, all three of the
following job launching commands satisfy this requirement:</para>
<programlisting>
$ HADOOP_CLASSPATH=/path/to/hbase-protocol.jar:/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass
$ HADOOP_CLASSPATH=$(hbase mapredcp):/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass
$ HADOOP_CLASSPATH=$(hbase classpath) hadoop jar MyJob.jar MyJobMainClass
</programlisting>
<para>For jars that do not package their dependencies, the following command
structure is necessary:</para>
<programlisting>
$ HADOOP_CLASSPATH=$(hbase mapredcp):/etc/hbase/conf hadoop jar MyApp.jar MyJobMainClass -libjars $(hbase mapredcp | tr ':' ',') ...
</programlisting>
<para>See also <link
xlink:href="https://issues.apache.org/jira/browse/HBASE-10304">HBASE-10304</link>
for further discussion of this issue.</para>
</caution>
<section xml:id="splitter">
<title>Map-Task Splitting</title>
<section xml:id="splitter.default">
<title>The Default HBase MapReduce Splitter</title>
<para>When <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html">TableInputFormat</link>
is used to source an HBase table in a MapReduce job,
its splitter will make a map task for each region of the table.
Thus, if there are 100 regions in the table, there will be
100 map-tasks for the job - regardless of how many column families are selected in the Scan.</para>
</screen>
<para>This is caused by an optimization introduced in <link
xlink:href="https://issues.apache.org/jira/browse/HBASE-9867">HBASE-9867</link> that
inadvertently introduced a classloader dependency. </para>
<para>This affects both jobs using the <code>-libjars</code> option and "fat jar," those
which package their runtime dependencies in a nested <code>lib</code> folder.</para>
<para>In order to satisfy the new classloader requirements, hbase-protocol.jar must be
included in Hadoop's classpath. See <xref
linkend="hbase.mapreduce.classpath" /> for current recommendations for resolving
classpath errors. The following is included for historical purposes.</para>
<para>This can be resolved system-wide by including a reference to the hbase-protocol.jar in
hadoop's lib directory, via a symlink or by copying the jar into the new location.</para>
<para>This can also be achieved on a per-job launch basis by including it in the
<code>HADOOP_CLASSPATH</code> environment variable at job submission time. When
launching jobs that package their dependencies, all three of the following job launching
commands satisfy this requirement:</para>
<screen>
$ <userinput>HADOOP_CLASSPATH=/path/to/hbase-protocol.jar:/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass</userinput>
$ <userinput>HADOOP_CLASSPATH=$(hbase mapredcp):/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass</userinput>
$ <userinput>HADOOP_CLASSPATH=$(hbase classpath) hadoop jar MyJob.jar MyJobMainClass</userinput>
</screen>
<para>For jars that do not package their dependencies, the following command structure is
necessary:</para>
<screen>
$ <userinput>HADOOP_CLASSPATH=$(hbase mapredcp):/etc/hbase/conf hadoop jar MyApp.jar MyJobMainClass -libjars $(hbase mapredcp | tr ':' ',')</userinput> ...
</screen>
<para>See also <link
xlink:href="https://issues.apache.org/jira/browse/HBASE-10304">HBASE-10304</link> for
further discussion of this issue.</para>
</caution>
</section>
<section xml:id="splitter.custom">
<title>Custom Splitters</title>
<para>For those interested in implementing custom splitters, see the method <code>getSplits</code> in
<link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.html">TableInputFormatBase</link>.
That is where the logic for map-task assignment resides.
</para>
<section>
<title>Bundled HBase MapReduce Jobs</title>
<para>The HBase JAR also serves as a Driver for some bundled mapreduce jobs. To learn about
the bundled MapReduce jobs, run the following command.</para>
<screen>$ <userinput>${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-0.90.0-SNAPSHOT.jar</userinput>
<computeroutput>An example program must be given as the first argument.
Valid program names are:
copytable: Export a table from local cluster to peer cluster
completebulkload: Complete a bulk data load.
export: Write table data to HDFS.
import: Import data written by Export.
importtsv: Import data in TSV format.
rowcounter: Count rows in HBase table</computeroutput>
</screen>
<para>Each of the valid program names are bundled MapReduce jobs. To run one of the jobs,
model your command after the following example.</para>
<screen>$ <userinput>${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-0.90.0-SNAPSHOT.jar rowcounter myTable</userinput></screen>
</section>
</section>
<section xml:id="mapreduce.example">
<title>HBase MapReduce Examples</title>
<section xml:id="mapreduce.example.read">
<title>HBase MapReduce Read Example</title>
<para>The following is an example of using HBase as a MapReduce source in read-only manner. Specifically,
there is a Mapper instance but no Reducer, and nothing is being emitted from the Mapper. There job would be defined
as follows...
<programlisting>
<section>
<title>HBase as a MapReduce Job Data Source and Data Sink</title>
<para>HBase can be used as a data source, <link
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html">TableInputFormat</link>,
and data sink, <link
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html">TableOutputFormat</link>
or <link
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/MultiTableOutputFormat.html">MultiTableOutputFormat</link>,
for MapReduce jobs. Writing MapReduce jobs that read or write HBase, it is advisable to
subclass <link
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableMapper.html">TableMapper</link>
and/or <link
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableReducer.html">TableReducer</link>.
See the do-nothing pass-through classes <link
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/IdentityTableMapper.html">IdentityTableMapper</link>
and <link
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/IdentityTableReducer.html">IdentityTableReducer</link>
for basic usage. For a more involved example, see <link
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html">RowCounter</link>
or review the <code>org.apache.hadoop.hbase.mapreduce.TestTableMapReduce</code> unit test. </para>
<para>If you run MapReduce jobs that use HBase as source or sink, need to specify source and
sink table and column names in your configuration.</para>
<para>When you read from HBase, the <code>TableInputFormat</code> requests the list of regions
from HBase and makes a map, which is either a <code>map-per-region</code> or
<code>mapreduce.job.maps</code> map, whichever is smaller. If your job only has two maps,
raise <code>mapreduce.job.maps</code> to a number greater than the number of regions. Maps
will run on the adjacent TaskTracker if you are running a TaskTracer and RegionServer per
node. When writing to HBase, it may make sense to avoid the Reduce step and write back into
HBase from within your map. This approach works when your job does not need the sort and
collation that MapReduce does on the map-emitted data. On insert, HBase 'sorts' so there is
no point double-sorting (and shuffling data around your MapReduce cluster) unless you need
to. If you do not need the Reduce, you myour map might emit counts of records processed for
reporting at the end of the jobj, or set the number of Reduces to zero and use
TableOutputFormat. If running the Reduce step makes sense in your case, you should typically
use multiple reducers so that load is spread across the HBase cluster.</para>
<para>A new HBase partitioner, the <link
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/HRegionPartitioner.html">HRegionPartitioner</link>,
can run as many reducers the number of existing regions. The HRegionPartitioner is suitable
when your table is large and your upload will not greatly alter the number of existing
regions upon completion. Otherwise use the default partitioner. </para>
</section>
<section>
<title>Writing HFiles Directly During Bulk Import</title>
<para>If you are importing into a new table, you can bypass the HBase API and write your
content directly to the filesystem, formatted into HBase data files (HFiles). Your import
will run faster, perhaps an order of magnitude faster. For more on how this mechanism works,
see <xref
linkend="arch.bulk.load" />.</para>
</section>
<section>
<title>RowCounter Example</title>
<para>The included <link
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html">RowCounter</link>
MapReduce job uses <code>TableInputFormat</code> and does a count of all rows in the specified
table. To run it, use the following command: </para>
<screen>$ <userinput>./bin/hadoop jar hbase-X.X.X.jar</userinput></screen>
<para>This will
invoke the HBase MapReduce Driver class. Select <literal>rowcounter</literal> from the choice of jobs
offered. This will print rowcouner usage advice to standard output. Specify the tablename,
column to count, and output
directory. If you have classpath errors, see <xref linkend="hbase.mapreduce.classpath" />.</para>
</section>
<section
xml:id="splitter">
<title>Map-Task Splitting</title>
<section
xml:id="splitter.default">
<title>The Default HBase MapReduce Splitter</title>
<para>When <link
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html">TableInputFormat</link>
is used to source an HBase table in a MapReduce job, its splitter will make a map task for
each region of the table. Thus, if there are 100 regions in the table, there will be 100
map-tasks for the job - regardless of how many column families are selected in the
Scan.</para>
</section>
<section
xml:id="splitter.custom">
<title>Custom Splitters</title>
<para>For those interested in implementing custom splitters, see the method
<code>getSplits</code> in <link
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.html">TableInputFormatBase</link>.
That is where the logic for map-task assignment resides. </para>
</section>
</section>
<section
xml:id="mapreduce.example">
<title>HBase MapReduce Examples</title>
<section
xml:id="mapreduce.example.read">
<title>HBase MapReduce Read Example</title>
<para>The following is an example of using HBase as a MapReduce source in read-only manner.
Specifically, there is a Mapper instance but no Reducer, and nothing is being emitted from
the Mapper. There job would be defined as follows...</para>
<programlisting>
Configuration config = HBaseConfiguration.create();
Job job = new Job(config, "ExampleRead");
job.setJarByClass(MyReadJob.class); // class that contains mapper
@ -784,8 +937,9 @@ if (!b) {
throw new IOException("error with job!");
}
</programlisting>
...and the mapper instance would extend <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableMapper.html">TableMapper</link>...
<programlisting>
<para>...and the mapper instance would extend <link
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableMapper.html">TableMapper</link>...</para>
<programlisting>
public static class MyMapper extends TableMapper&lt;Text, Text&gt; {
public void map(ImmutableBytesWritable row, Result value, Context context) throws InterruptedException, IOException {
@ -793,13 +947,13 @@ public static class MyMapper extends TableMapper&lt;Text, Text&gt; {
}
}
</programlisting>
</para>
</section>
<section xml:id="mapreduce.example.readwrite">
<title>HBase MapReduce Read/Write Example</title>
<para>The following is an example of using HBase both as a source and as a sink with MapReduce.
This example will simply copy data from one table to another.</para>
<programlisting>
</section>
<section
xml:id="mapreduce.example.readwrite">
<title>HBase MapReduce Read/Write Example</title>
<para>The following is an example of using HBase both as a source and as a sink with
MapReduce. This example will simply copy data from one table to another.</para>
<programlisting>
Configuration config = HBaseConfiguration.create();
Job job = new Job(config,"ExampleReadWrite");
job.setJarByClass(MyReadWriteJob.class); // class that contains mapper
@ -827,15 +981,18 @@ if (!b) {
throw new IOException("error with job!");
}
</programlisting>
<para>An explanation is required of what <classname>TableMapReduceUtil</classname> is doing, especially with the reducer.
<link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html">TableOutputFormat</link> is being used
as the outputFormat class, and several parameters are being set on the config (e.g., TableOutputFormat.OUTPUT_TABLE), as
well as setting the reducer output key to <classname>ImmutableBytesWritable</classname> and reducer value to <classname>Writable</classname>.
These could be set by the programmer on the job and conf, but <classname>TableMapReduceUtil</classname> tries to make things easier.</para>
<para>The following is the example mapper, which will create a <classname>Put</classname> and matching the input <classname>Result</classname>
and emit it. Note: this is what the CopyTable utility does.
</para>
<programlisting>
<para>An explanation is required of what <classname>TableMapReduceUtil</classname> is doing,
especially with the reducer. <link
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html">TableOutputFormat</link>
is being used as the outputFormat class, and several parameters are being set on the
config (e.g., TableOutputFormat.OUTPUT_TABLE), as well as setting the reducer output key
to <classname>ImmutableBytesWritable</classname> and reducer value to
<classname>Writable</classname>. These could be set by the programmer on the job and
conf, but <classname>TableMapReduceUtil</classname> tries to make things easier.</para>
<para>The following is the example mapper, which will create a <classname>Put</classname>
and matching the input <classname>Result</classname> and emit it. Note: this is what the
CopyTable utility does. </para>
<programlisting>
public static class MyMapper extends TableMapper&lt;ImmutableBytesWritable, Put&gt; {
public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {
@ -852,23 +1009,24 @@ public static class MyMapper extends TableMapper&lt;ImmutableBytesWritable, Put&
}
}
</programlisting>
<para>There isn't actually a reducer step, so <classname>TableOutputFormat</classname> takes care of sending the <classname>Put</classname>
to the target table.
</para>
<para>This is just an example, developers could choose not to use <classname>TableOutputFormat</classname> and connect to the
target table themselves.
</para>
</section>
<section xml:id="mapreduce.example.readwrite.multi">
<title>HBase MapReduce Read/Write Example With Multi-Table Output</title>
<para>TODO: example for <classname>MultiTableOutputFormat</classname>.
</para>
</section>
<section xml:id="mapreduce.example.summary">
<title>HBase MapReduce Summary to HBase Example</title>
<para>The following example uses HBase as a MapReduce source and sink with a summarization step. This example will
count the number of distinct instances of a value in a table and write those summarized counts in another table.
<programlisting>
<para>There isn't actually a reducer step, so <classname>TableOutputFormat</classname> takes
care of sending the <classname>Put</classname> to the target table. </para>
<para>This is just an example, developers could choose not to use
<classname>TableOutputFormat</classname> and connect to the target table themselves.
</para>
</section>
<section
xml:id="mapreduce.example.readwrite.multi">
<title>HBase MapReduce Read/Write Example With Multi-Table Output</title>
<para>TODO: example for <classname>MultiTableOutputFormat</classname>. </para>
</section>
<section
xml:id="mapreduce.example.summary">
<title>HBase MapReduce Summary to HBase Example</title>
<para>The following example uses HBase as a MapReduce source and sink with a summarization
step. This example will count the number of distinct instances of a value in a table and
write those summarized counts in another table.
<programlisting>
Configuration config = HBaseConfiguration.create();
Job job = new Job(config,"ExampleSummary");
job.setJarByClass(MySummaryJob.class); // class that contains mapper and reducer
@ -896,9 +1054,10 @@ if (!b) {
throw new IOException("error with job!");
}
</programlisting>
In this example mapper a column with a String-value is chosen as the value to summarize upon.
This value is used as the key to emit from the mapper, and an <classname>IntWritable</classname> represents an instance counter.
<programlisting>
In this example mapper a column with a String-value is chosen as the value to summarize
upon. This value is used as the key to emit from the mapper, and an
<classname>IntWritable</classname> represents an instance counter.
<programlisting>
public static class MyMapper extends TableMapper&lt;Text, IntWritable&gt; {
public static final byte[] CF = "cf".getBytes();
public static final byte[] ATTR1 = "attr1".getBytes();
@ -914,8 +1073,9 @@ public static class MyMapper extends TableMapper&lt;Text, IntWritable&gt; {
}
}
</programlisting>
In the reducer, the "ones" are counted (just like any other MR example that does this), and then emits a <classname>Put</classname>.
<programlisting>
In the reducer, the "ones" are counted (just like any other MR example that does this),
and then emits a <classname>Put</classname>.
<programlisting>
public static class MyTableReducer extends TableReducer&lt;Text, IntWritable, ImmutableBytesWritable&gt; {
public static final byte[] CF = "cf".getBytes();
public static final byte[] COUNT = "count".getBytes();
@ -932,14 +1092,15 @@ public static class MyTableReducer extends TableReducer&lt;Text, IntWritable, Im
}
}
</programlisting>
</para>
</section>
<section xml:id="mapreduce.example.summary.file">
<title>HBase MapReduce Summary to File Example</title>
<para>This very similar to the summary example above, with exception that this is using HBase as a MapReduce source
but HDFS as the sink. The differences are in the job setup and in the reducer. The mapper remains the same.
</para>
<programlisting>
</para>
</section>
<section
xml:id="mapreduce.example.summary.file">
<title>HBase MapReduce Summary to File Example</title>
<para>This very similar to the summary example above, with exception that this is using
HBase as a MapReduce source but HDFS as the sink. The differences are in the job setup and
in the reducer. The mapper remains the same. </para>
<programlisting>
Configuration config = HBaseConfiguration.create();
Job job = new Job(config,"ExampleSummaryToFile");
job.setJarByClass(MySummaryFileJob.class); // class that contains mapper and reducer
@ -965,9 +1126,10 @@ if (!b) {
throw new IOException("error with job!");
}
</programlisting>
<para>As stated above, the previous Mapper can run unchanged with this example.
As for the Reducer, it is a "generic" Reducer instead of extending TableMapper and emitting Puts.</para>
<programlisting>
<para>As stated above, the previous Mapper can run unchanged with this example. As for the
Reducer, it is a "generic" Reducer instead of extending TableMapper and emitting
Puts.</para>
<programlisting>
public static class MyReducer extends Reducer&lt;Text, IntWritable, Text, IntWritable&gt; {
public void reduce(Text key, Iterable&lt;IntWritable&gt; values, Context context) throws IOException, InterruptedException {
@ -979,33 +1141,35 @@ if (!b) {
}
}
</programlisting>
</section>
<section xml:id="mapreduce.example.summary.noreducer">
<title>HBase MapReduce Summary to HBase Without Reducer</title>
<para>It is also possible to perform summaries without a reducer - if you use HBase as the reducer.
</para>
<para>An HBase target table would need to exist for the job summary. The HTable method <code>incrementColumnValue</code>
would be used to atomically increment values. From a performance perspective, it might make sense to keep a Map
of values with their values to be incremeneted for each map-task, and make one update per key at during the <code>
cleanup</code> method of the mapper. However, your milage may vary depending on the number of rows to be processed and
unique keys.
</para>
<para>In the end, the summary results are in HBase.
</para>
</section>
<section xml:id="mapreduce.example.summary.rdbms">
<title>HBase MapReduce Summary to RDBMS</title>
<para>Sometimes it is more appropriate to generate summaries to an RDBMS. For these cases, it is possible
to generate summaries directly to an RDBMS via a custom reducer. The <code>setup</code> method
can connect to an RDBMS (the connection information can be passed via custom parameters in the context) and the
cleanup method can close the connection.
</para>
<para>It is critical to understand that number of reducers for the job affects the summarization implementation, and
you'll have to design this into your reducer. Specifically, whether it is designed to run as a singleton (one reducer)
or multiple reducers. Neither is right or wrong, it depends on your use-case. Recognize that the more reducers that
are assigned to the job, the more simultaneous connections to the RDBMS will be created - this will scale, but only to a point.
</para>
<programlisting>
</section>
<section
xml:id="mapreduce.example.summary.noreducer">
<title>HBase MapReduce Summary to HBase Without Reducer</title>
<para>It is also possible to perform summaries without a reducer - if you use HBase as the
reducer. </para>
<para>An HBase target table would need to exist for the job summary. The HTable method
<code>incrementColumnValue</code> would be used to atomically increment values. From a
performance perspective, it might make sense to keep a Map of values with their values to
be incremeneted for each map-task, and make one update per key at during the <code>
cleanup</code> method of the mapper. However, your milage may vary depending on the
number of rows to be processed and unique keys. </para>
<para>In the end, the summary results are in HBase. </para>
</section>
<section
xml:id="mapreduce.example.summary.rdbms">
<title>HBase MapReduce Summary to RDBMS</title>
<para>Sometimes it is more appropriate to generate summaries to an RDBMS. For these cases,
it is possible to generate summaries directly to an RDBMS via a custom reducer. The
<code>setup</code> method can connect to an RDBMS (the connection information can be
passed via custom parameters in the context) and the cleanup method can close the
connection. </para>
<para>It is critical to understand that number of reducers for the job affects the
summarization implementation, and you'll have to design this into your reducer.
Specifically, whether it is designed to run as a singleton (one reducer) or multiple
reducers. Neither is right or wrong, it depends on your use-case. Recognize that the more
reducers that are assigned to the job, the more simultaneous connections to the RDBMS will
be created - this will scale, but only to a point. </para>
<programlisting>
public static class MyRdbmsReducer extends Reducer&lt;Text, IntWritable, Text, IntWritable&gt; {
private Connection c = null;
@ -1025,18 +1189,18 @@ if (!b) {
}
</programlisting>
<para>In the end, the summary results are written to your RDBMS table/s.
</para>
</section>
<para>In the end, the summary results are written to your RDBMS table/s. </para>
</section>
</section> <!-- mr examples -->
<section xml:id="mapreduce.htable.access">
<title>Accessing Other HBase Tables in a MapReduce Job</title>
<para>Although the framework currently allows one HBase table as input to a
MapReduce job, other HBase tables can
be accessed as lookup tables, etc., in a
MapReduce job via creating an HTable instance in the setup method of the Mapper.
<programlisting>public class MyMapper extends TableMapper&lt;Text, LongWritable&gt; {
</section>
<!-- mr examples -->
<section
xml:id="mapreduce.htable.access">
<title>Accessing Other HBase Tables in a MapReduce Job</title>
<para>Although the framework currently allows one HBase table as input to a MapReduce job,
other HBase tables can be accessed as lookup tables, etc., in a MapReduce job via creating
an HTable instance in the setup method of the Mapper.
<programlisting>public class MyMapper extends TableMapper&lt;Text, LongWritable&gt; {
private HTable myOtherTable;
public void setup(Context context) {
@ -1049,20 +1213,19 @@ if (!b) {
}
</programlisting>
</para>
</para>
</section>
<section
xml:id="mapreduce.specex">
<title>Speculative Execution</title>
<para>It is generally advisable to turn off speculative execution for MapReduce jobs that use
HBase as a source. This can either be done on a per-Job basis through properties, on on the
entire cluster. Especially for longer running jobs, speculative execution will create
duplicate map-tasks which will double-write your data to HBase; this is probably not what
you want. </para>
<para>See <xref
linkend="spec.ex" /> for more information. </para>
</section>
<section xml:id="mapreduce.specex">
<title>Speculative Execution</title>
<para>It is generally advisable to turn off speculative execution for
MapReduce jobs that use HBase as a source. This can either be done on a
per-Job basis through properties, on on the entire cluster. Especially
for longer running jobs, speculative execution will create duplicate
map-tasks which will double-write your data to HBase; this is probably
not what you want.
</para>
<para>See <xref linkend="spec.ex"/> for more information.
</para>
</section>
</chapter> <!-- mapreduce -->
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="security.xml" />