HBASE-8807 HBase MapReduce Job-Launch Documentation Misplaced (Misty Stanley-Jones)
This commit is contained in:
parent
d9739b9e3f
commit
ef995efb1a
|
@ -20,104 +20,7 @@
|
||||||
Provides HBase <a href="http://wiki.apache.org/hadoop/HadoopMapReduce">MapReduce</a>
|
Provides HBase <a href="http://wiki.apache.org/hadoop/HadoopMapReduce">MapReduce</a>
|
||||||
Input/OutputFormats, a table indexing MapReduce job, and utility
|
Input/OutputFormats, a table indexing MapReduce job, and utility
|
||||||
|
|
||||||
<h2>Table of Contents</h2>
|
<p>See <a href="http://hbase.apache.org/book.html#mapreduce">HBase and MapReduce</a>
|
||||||
<ul>
|
in the HBase Reference Guide for mapreduce over hbase documentation.
|
||||||
<li><a href="#classpath">HBase, MapReduce and the CLASSPATH</a></li>
|
|
||||||
<li><a href="#sink">HBase as MapReduce job data source and sink</a></li>
|
|
||||||
<li><a href="#examples">Example Code</a></li>
|
|
||||||
</ul>
|
|
||||||
|
|
||||||
<h2><a name="classpath">HBase, MapReduce and the CLASSPATH</a></h2>
|
|
||||||
|
|
||||||
<p>MapReduce jobs deployed to a MapReduce cluster do not by default have access
|
|
||||||
to the HBase configuration under <code>$HBASE_CONF_DIR</code> nor to HBase classes.
|
|
||||||
You could add <code>hbase-site.xml</code> to $HADOOP_HOME/conf and add
|
|
||||||
<code>hbase-X.X.X.jar</code> to the <code>$HADOOP_HOME/lib</code> and copy these
|
|
||||||
changes across your cluster but the cleanest means of adding hbase configuration
|
|
||||||
and classes to the cluster <code>CLASSPATH</code> is by uncommenting
|
|
||||||
<code>HADOOP_CLASSPATH</code> in <code>$HADOOP_HOME/conf/hadoop-env.sh</code>
|
|
||||||
adding hbase dependencies here. For example, here is how you would amend
|
|
||||||
<code>hadoop-env.sh</code> adding the
|
|
||||||
built hbase jar, zookeeper (needed by hbase client), hbase conf, and the
|
|
||||||
<code>PerformanceEvaluation</code> class from the built hbase test jar to the
|
|
||||||
hadoop <code>CLASSPATH</code>:
|
|
||||||
|
|
||||||
<blockquote><pre># Extra Java CLASSPATH elements. Optional.
|
|
||||||
# export HADOOP_CLASSPATH=
|
|
||||||
export HADOOP_CLASSPATH=$HBASE_HOME/build/hbase-X.X.X.jar:$HBASE_HOME/build/hbase-X.X.X-test.jar:$HBASE_HOME/conf:${HBASE_HOME}/lib/zookeeper-X.X.X.jar</pre></blockquote>
|
|
||||||
|
|
||||||
<p>Expand <code>$HBASE_HOME</code> in the above appropriately to suit your
|
|
||||||
local environment.</p>
|
|
||||||
|
|
||||||
<p>After copying the above change around your cluster (and restarting), this is
|
|
||||||
how you would run the PerformanceEvaluation MR job to put up 4 clients (Presumes
|
|
||||||
a ready mapreduce cluster):
|
|
||||||
|
|
||||||
<blockquote><pre>$HADOOP_HOME/bin/hadoop org.apache.hadoop.hbase.PerformanceEvaluation sequentialWrite 4</pre></blockquote>
|
|
||||||
|
|
||||||
The PerformanceEvaluation class wil be found on the CLASSPATH because you
|
|
||||||
added <code>$HBASE_HOME/build/test</code> to HADOOP_CLASSPATH
|
|
||||||
</p>
|
|
||||||
|
|
||||||
<p>Another possibility, if for example you do not have access to hadoop-env.sh or
|
|
||||||
are unable to restart the hadoop cluster, is bundling the hbase jar into a mapreduce
|
|
||||||
job jar adding it and its dependencies under the job jar <code>lib/</code>
|
|
||||||
directory and the hbase conf into a job jar <code>conf/</code> directory.
|
|
||||||
</a>
|
|
||||||
|
|
||||||
<h2><a name="sink">HBase as MapReduce job data source and sink</a></h2>
|
|
||||||
|
|
||||||
<p>HBase can be used as a data source, {@link org.apache.hadoop.hbase.mapred.TableInputFormat TableInputFormat},
|
|
||||||
and data sink, {@link org.apache.hadoop.hbase.mapred.TableOutputFormat TableOutputFormat}, for MapReduce jobs.
|
|
||||||
Writing MapReduce jobs that read or write HBase, you'll probably want to subclass
|
|
||||||
{@link org.apache.hadoop.hbase.mapred.TableMap TableMap} and/or
|
|
||||||
{@link org.apache.hadoop.hbase.mapred.TableReduce TableReduce}. See the do-nothing
|
|
||||||
pass-through classes {@link org.apache.hadoop.hbase.mapred.IdentityTableMap IdentityTableMap} and
|
|
||||||
{@link org.apache.hadoop.hbase.mapred.IdentityTableReduce IdentityTableReduce} for basic usage. For a more
|
|
||||||
involved example, see <code>BuildTableIndex</code>
|
|
||||||
or review the <code>org.apache.hadoop.hbase.mapred.TestTableMapReduce</code> unit test.
|
|
||||||
</p>
|
|
||||||
|
|
||||||
<p>Running mapreduce jobs that have hbase as source or sink, you'll need to
|
|
||||||
specify source/sink table and column names in your configuration.</p>
|
|
||||||
|
|
||||||
<p>Reading from hbase, the TableInputFormat asks hbase for the list of
|
|
||||||
regions and makes a map-per-region or <code>mapred.map.tasks maps</code>,
|
|
||||||
whichever is smaller (If your job only has two maps, up mapred.map.tasks
|
|
||||||
to a number > number of regions). Maps will run on the adjacent TaskTracker
|
|
||||||
if you are running a TaskTracer and RegionServer per node.
|
|
||||||
Writing, it may make sense to avoid the reduce step and write yourself back into
|
|
||||||
hbase from inside your map. You'd do this when your job does not need the sort
|
|
||||||
and collation that mapreduce does on the map emitted data; on insert,
|
|
||||||
hbase 'sorts' so there is no point double-sorting (and shuffling data around
|
|
||||||
your mapreduce cluster) unless you need to. If you do not need the reduce,
|
|
||||||
you might just have your map emit counts of records processed just so the
|
|
||||||
framework's report at the end of your job has meaning or set the number of
|
|
||||||
reduces to zero and use TableOutputFormat. See example code
|
|
||||||
below. If running the reduce step makes sense in your case, its usually better
|
|
||||||
to have lots of reducers so load is spread across the hbase cluster.</p>
|
|
||||||
|
|
||||||
<p>There is also a new hbase partitioner that will run as many reducers as
|
|
||||||
currently existing regions. The
|
|
||||||
{@link org.apache.hadoop.hbase.mapred.HRegionPartitioner} is suitable
|
|
||||||
when your table is large and your upload is not such that it will greatly
|
|
||||||
alter the number of existing regions when done; other use the default
|
|
||||||
partitioner.
|
|
||||||
</p>
|
|
||||||
|
|
||||||
<h2><a name="examples">Example Code</a></h2>
|
|
||||||
<h3>Sample Row Counter</h3>
|
|
||||||
<p>See {@link org.apache.hadoop.hbase.mapred.RowCounter}. You should be able to run
|
|
||||||
it by doing: <code>% ./bin/hadoop jar hbase-X.X.X.jar</code>. This will invoke
|
|
||||||
the hbase MapReduce Driver class. Select 'rowcounter' from the choice of jobs
|
|
||||||
offered. You may need to add the hbase conf directory to <code>$HADOOP_HOME/conf/hadoop-env.sh#HADOOP_CLASSPATH</code>
|
|
||||||
so the rowcounter gets pointed at the right hbase cluster (or, build a new jar
|
|
||||||
with an appropriate hbase-site.xml built into your job jar).
|
|
||||||
</p>
|
|
||||||
<h3>PerformanceEvaluation</h3>
|
|
||||||
<p>See org.apache.hadoop.hbase.PerformanceEvaluation from hbase src/test. It runs
|
|
||||||
a mapreduce job to run concurrent clients reading and writing hbase.
|
|
||||||
</p>
|
|
||||||
|
|
||||||
*/
|
*/
|
||||||
package org.apache.hadoop.hbase.mapred;
|
package org.apache.hadoop.hbase.mapred;
|
||||||
|
|
|
@ -20,144 +20,7 @@
|
||||||
Provides HBase <a href="http://wiki.apache.org/hadoop/HadoopMapReduce">MapReduce</a>
|
Provides HBase <a href="http://wiki.apache.org/hadoop/HadoopMapReduce">MapReduce</a>
|
||||||
Input/OutputFormats, a table indexing MapReduce job, and utility
|
Input/OutputFormats, a table indexing MapReduce job, and utility
|
||||||
|
|
||||||
<h2>Table of Contents</h2>
|
<p>See <a href="http://hbase.apache.org/book.html#mapreduce">HBase and MapReduce</a>
|
||||||
<ul>
|
in the HBase Reference Guide for mapreduce over hbase documentation.
|
||||||
<li><a href="#classpath">HBase, MapReduce and the CLASSPATH</a></li>
|
|
||||||
<li><a href="#driver">Bundled HBase MapReduce Jobs</a></li>
|
|
||||||
<li><a href="#sink">HBase as MapReduce job data source and sink</a></li>
|
|
||||||
<li><a href="#bulk">Bulk Import writing HFiles directly</a></li>
|
|
||||||
<li><a href="#examples">Example Code</a></li>
|
|
||||||
</ul>
|
|
||||||
|
|
||||||
<h2><a name="classpath">HBase, MapReduce and the CLASSPATH</a></h2>
|
|
||||||
|
|
||||||
<p>MapReduce jobs deployed to a MapReduce cluster do not by default have access
|
|
||||||
to the HBase configuration under <code>$HBASE_CONF_DIR</code> nor to HBase classes.
|
|
||||||
You could add <code>hbase-site.xml</code> to
|
|
||||||
<code>$HADOOP_HOME/conf</code> and add
|
|
||||||
HBase jars to the <code>$HADOOP_HOME/lib</code> and copy these
|
|
||||||
changes across your cluster (or edit conf/hadoop-env.sh and add them to the
|
|
||||||
<code>HADOOP_CLASSPATH</code> variable) but this will pollute your
|
|
||||||
hadoop install with HBase references; its also obnoxious requiring restart of
|
|
||||||
the hadoop cluster before it'll notice your HBase additions.</p>
|
|
||||||
|
|
||||||
<p>As of 0.90.x, HBase will just add its dependency jars to the job
|
|
||||||
configuration; the dependencies just need to be available on the local
|
|
||||||
<code>CLASSPATH</code>. For example, to run the bundled HBase
|
|
||||||
{@link org.apache.hadoop.hbase.mapreduce.RowCounter} mapreduce job against a table named <code>usertable</code>,
|
|
||||||
type:
|
|
||||||
|
|
||||||
<blockquote><pre>
|
|
||||||
$ HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-0.90.0.jar rowcounter usertable
|
|
||||||
</pre></blockquote>
|
|
||||||
|
|
||||||
Expand <code>$HBASE_HOME</code> and <code>$HADOOP_HOME</code> in the above
|
|
||||||
appropriately to suit your local environment. The content of <code>HADOOP_CLASSPATH</code>
|
|
||||||
is set to the HBase <code>CLASSPATH</code> via backticking the command
|
|
||||||
<code>${HBASE_HOME}/bin/hbase classpath</code>.
|
|
||||||
|
|
||||||
<p>When the above runs, internally, the HBase jar finds its zookeeper and
|
|
||||||
<a href="http://code.google.com/p/guava-libraries/">guava</a>,
|
|
||||||
etc., dependencies on the passed
|
|
||||||
</code>HADOOP_CLASSPATH</code> and adds the found jars to the mapreduce
|
|
||||||
job configuration. See the source at
|
|
||||||
<code>TableMapReduceUtil#addDependencyJars(org.apache.hadoop.mapreduce.Job)</code>
|
|
||||||
for how this is done.
|
|
||||||
</p>
|
|
||||||
<p>The above may not work if you are running your HBase from its build directory;
|
|
||||||
i.e. you've done <code>$ mvn test install</code> at
|
|
||||||
<code>${HBASE_HOME}</code> and you are now
|
|
||||||
trying to use this build in your mapreduce job. If you get
|
|
||||||
<blockquote><pre>java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.mapreduce.RowCounter$RowCounterMapper
|
|
||||||
...
|
|
||||||
</pre></blockquote>
|
|
||||||
exception thrown, try doing the following:
|
|
||||||
<blockquote><pre>
|
|
||||||
$ HADOOP_CLASSPATH=${HBASE_HOME}/target/hbase-0.90.0-SNAPSHOT.jar:`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/target/hbase-0.90.0-SNAPSHOT.jar rowcounter usertable
|
|
||||||
</pre></blockquote>
|
|
||||||
Notice how we preface the backtick invocation setting
|
|
||||||
<code>HADOOP_CLASSPATH</code> with reference to the built HBase jar over in
|
|
||||||
the <code>target</code> directory.
|
|
||||||
</p>
|
|
||||||
|
|
||||||
<h2><a name="driver">Bundled HBase MapReduce Jobs</a></h2>
|
|
||||||
<p>The HBase jar also serves as a Driver for some bundled mapreduce jobs. To
|
|
||||||
learn about the bundled mapreduce jobs run:
|
|
||||||
<blockquote><pre>
|
|
||||||
$ ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-0.90.0-SNAPSHOT.jar
|
|
||||||
An example program must be given as the first argument.
|
|
||||||
Valid program names are:
|
|
||||||
copytable: Export a table from local cluster to peer cluster
|
|
||||||
completebulkload: Complete a bulk data load.
|
|
||||||
export: Write table data to HDFS.
|
|
||||||
import: Import data written by Export.
|
|
||||||
importtsv: Import data in TSV format.
|
|
||||||
rowcounter: Count rows in HBase table
|
|
||||||
</pre></blockquote>
|
|
||||||
|
|
||||||
<h2><a name="sink">HBase as MapReduce job data source and sink</a></h2>
|
|
||||||
|
|
||||||
<p>HBase can be used as a data source, {@link org.apache.hadoop.hbase.mapreduce.TableInputFormat TableInputFormat},
|
|
||||||
and data sink, {@link org.apache.hadoop.hbase.mapreduce.TableOutputFormat TableOutputFormat}
|
|
||||||
or {@link org.apache.hadoop.hbase.mapreduce.MultiTableOutputFormat MultiTableOutputFormat},
|
|
||||||
for MapReduce jobs.
|
|
||||||
Writing MapReduce jobs that read or write HBase, you'll probably want to subclass
|
|
||||||
{@link org.apache.hadoop.hbase.mapreduce.TableMapper TableMapper} and/or
|
|
||||||
{@link org.apache.hadoop.hbase.mapreduce.TableReducer TableReducer}. See the do-nothing
|
|
||||||
pass-through classes {@link org.apache.hadoop.hbase.mapreduce.IdentityTableMapper IdentityTableMapper} and
|
|
||||||
{@link org.apache.hadoop.hbase.mapreduce.IdentityTableReducer IdentityTableReducer} for basic usage. For a more
|
|
||||||
involved example, see {@link org.apache.hadoop.hbase.mapreduce.RowCounter}
|
|
||||||
or review the <code>org.apache.hadoop.hbase.mapreduce.TestTableMapReduce</code> unit test.
|
|
||||||
</p>
|
|
||||||
|
|
||||||
<p>Running mapreduce jobs that have HBase as source or sink, you'll need to
|
|
||||||
specify source/sink table and column names in your configuration.</p>
|
|
||||||
|
|
||||||
<p>Reading from HBase, the TableInputFormat asks HBase for the list of
|
|
||||||
regions and makes a map-per-region or <code>mapreduce.job.maps maps</code>,
|
|
||||||
whichever is smaller (If your job only has two maps, up mapreduce.job.maps
|
|
||||||
to a number > number of regions). Maps will run on the adjacent TaskTracker
|
|
||||||
if you are running a TaskTracer and RegionServer per node.
|
|
||||||
Writing, it may make sense to avoid the reduce step and write yourself back into
|
|
||||||
HBase from inside your map. You'd do this when your job does not need the sort
|
|
||||||
and collation that mapreduce does on the map emitted data; on insert,
|
|
||||||
HBase 'sorts' so there is no point double-sorting (and shuffling data around
|
|
||||||
your mapreduce cluster) unless you need to. If you do not need the reduce,
|
|
||||||
you might just have your map emit counts of records processed just so the
|
|
||||||
framework's report at the end of your job has meaning or set the number of
|
|
||||||
reduces to zero and use TableOutputFormat. See example code
|
|
||||||
below. If running the reduce step makes sense in your case, its usually better
|
|
||||||
to have lots of reducers so load is spread across the HBase cluster.</p>
|
|
||||||
|
|
||||||
<p>There is also a new HBase partitioner that will run as many reducers as
|
|
||||||
currently existing regions. The
|
|
||||||
{@link org.apache.hadoop.hbase.mapreduce.HRegionPartitioner} is suitable
|
|
||||||
when your table is large and your upload is not such that it will greatly
|
|
||||||
alter the number of existing regions when done; otherwise use the default
|
|
||||||
partitioner.
|
|
||||||
</p>
|
|
||||||
|
|
||||||
<h2><a name="bulk">Bulk import writing HFiles directly</a></h2>
|
|
||||||
<p>If importing into a new table, its possible to by-pass the HBase API
|
|
||||||
and write your content directly to the filesystem properly formatted as
|
|
||||||
HBase data files (HFiles). Your import will run faster, perhaps an order of
|
|
||||||
magnitude faster if not more. For more on how this mechanism works, see
|
|
||||||
<a href="http://hbase.apache.org/bulk-loads.html">Bulk Loads</code>
|
|
||||||
documentation.
|
|
||||||
</p>
|
|
||||||
|
|
||||||
<h2><a name="examples">Example Code</a></h2>
|
|
||||||
<h3>Sample Row Counter</h3>
|
|
||||||
<p>See {@link org.apache.hadoop.hbase.mapreduce.RowCounter}. This job uses
|
|
||||||
{@link org.apache.hadoop.hbase.mapreduce.TableInputFormat TableInputFormat} and
|
|
||||||
does a count of all rows in specified table.
|
|
||||||
You should be able to run
|
|
||||||
it by doing: <code>% ./bin/hadoop jar hbase-X.X.X.jar</code>. This will invoke
|
|
||||||
the hbase MapReduce Driver class. Select 'rowcounter' from the choice of jobs
|
|
||||||
offered. This will emit rowcouner 'usage'. Specify tablename, column to count
|
|
||||||
and output directory. You may need to add the hbase conf directory to <code>$HADOOP_HOME/conf/hadoop-env.sh#HADOOP_CLASSPATH</code>
|
|
||||||
so the rowcounter gets pointed at the right hbase cluster (or, build a new jar
|
|
||||||
with an appropriate hbase-site.xml built into your job jar).
|
|
||||||
</p>
|
|
||||||
*/
|
*/
|
||||||
package org.apache.hadoop.hbase.mapreduce;
|
package org.apache.hadoop.hbase.mapreduce;
|
||||||
|
|
|
@ -664,18 +664,76 @@ htable.put(put);
|
||||||
<!-- schema design -->
|
<!-- schema design -->
|
||||||
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="schema_design.xml"/>
|
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="schema_design.xml"/>
|
||||||
|
|
||||||
<chapter xml:id="mapreduce">
|
<chapter
|
||||||
<title>HBase and MapReduce</title>
|
xml:id="mapreduce">
|
||||||
<para>See <link xlink:href="http://hbase.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#package_description">
|
<title>HBase and MapReduce</title>
|
||||||
HBase and MapReduce</link> up in javadocs.
|
<para>Apache MapReduce is a software framework used to analyze large amounts of data, and is
|
||||||
Start there. Below is some additional help.</para>
|
the framework used most often with <link
|
||||||
<para>For more information about MapReduce (i.e., the framework in general), see the Hadoop site (TODO: Need good links here --
|
xlink:href="http://hadoop.apache.org/">Apache Hadoop</link>. MapReduce itself is out of the
|
||||||
we used to have some but they rotted against apache hadoop).</para>
|
scope of this document. A good place to get started with MapReduce is <link
|
||||||
<caution>
|
xlink:href="http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html" />. MapReduce version
|
||||||
<title>Notice to Mapreduce users of HBase 0.96.1 and above</title>
|
2 (MR2)is now part of <link
|
||||||
<para>Some mapreduce jobs that use HBase fail to launch. The symptom is an
|
xlink:href="http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/">YARN</link>. </para>
|
||||||
exception similar to the following:
|
|
||||||
<programlisting>
|
<para> This chapter discusses specific configuration steps you need to take to use MapReduce on
|
||||||
|
data within HBase. In addition, it discusses other interactions and issues between HBase and
|
||||||
|
MapReduce jobs.
|
||||||
|
<note>
|
||||||
|
<title>mapred and mapreduce</title>
|
||||||
|
<para>There are two mapreduce packages in HBase as in MapReduce itself: <filename>org.apache.hadoop.hbase.mapred</filename>
|
||||||
|
and <filename>org.apache.hadoop.hbase.mapreduce</filename>. The former does old-style API and the latter
|
||||||
|
the new style. The latter has more facility though you can usually find an equivalent in the older
|
||||||
|
package. Pick the package that goes with your mapreduce deploy. When in doubt or starting over, pick the
|
||||||
|
<filename>org.apache.hadoop.hbase.mapreduce</filename>. In the notes below, we refer to
|
||||||
|
o.a.h.h.mapreduce but replace with the o.a.h.h.mapred if that is what you are using.
|
||||||
|
</para>
|
||||||
|
</note>
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<section
|
||||||
|
xml:id="hbase.mapreduce.classpath">
|
||||||
|
<title>HBase, MapReduce, and the CLASSPATH</title>
|
||||||
|
<para>Ny default, MapReduce jobs deployed to a MapReduce cluster do not have access to either
|
||||||
|
the HBase configuration under <envar>$HBASE_CONF_DIR</envar> or the HBase classes.</para>
|
||||||
|
<para>To give the MapReduce jobs the access they need, you could add
|
||||||
|
<filename>hbase-site.xml</filename> to the
|
||||||
|
<filename><replaceable>$HADOOP_HOME</replaceable>/conf/</filename> directory and add the
|
||||||
|
HBase JARs to the <filename><replaceable>HADOOP_HOME</replaceable>/conf/</filename>
|
||||||
|
directory, then copy these changes across your cluster. You could add hbase-site.xml to
|
||||||
|
$HADOOP_HOME/conf and add HBase jars to the $HADOOP_HOME/lib. You would then need to copy
|
||||||
|
these changes across your cluster or edit
|
||||||
|
<filename><replaceable>$HADOOP_HOME</replaceable>conf/hadoop-env.sh</filename> and add
|
||||||
|
them to the <envar>HADOOP_CLASSPATH</envar> variable. However, this approach is not
|
||||||
|
recommended because it will pollute your Hadoop install with HBase references. It also
|
||||||
|
requires you to restart the Hadoop cluster before Hadoop can use the HBase data.</para>
|
||||||
|
<para> Since HBase 0.90.x, HBase adds its dependency JARs to the job configuration itself. The
|
||||||
|
dependencies only need to be available on the local CLASSPATH. The following example runs
|
||||||
|
the bundled HBase <link
|
||||||
|
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html">RowCounter</link>
|
||||||
|
MapReduce job against a table named <systemitem>usertable</systemitem> If you have not set
|
||||||
|
the environment variables expected in the command (the parts prefixed by a
|
||||||
|
<literal>$</literal> sign and curly braces), you can use the actual system paths instead.
|
||||||
|
Be sure to use the correct version of the HBase JAR for your system. The backticks
|
||||||
|
(<literal>`</literal> symbols) cause ths shell to execute the sub-commands, setting the
|
||||||
|
CLASSPATH as part of the command. This example assumes you use a BASH-compatible shell. </para>
|
||||||
|
<screen>$ <userinput>HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-0.90.0.jar rowcounter usertable</userinput></screen>
|
||||||
|
<para>When the command runs, internally, the HBase JAR finds the dependencies it needs for
|
||||||
|
zookeeper, guava, and its other dependencies on the passed <envar>HADOOP_CLASSPATH</envar>
|
||||||
|
and adds the JARs to the MapReduce job configuration. See the source at
|
||||||
|
TableMapReduceUtil#addDependencyJars(org.apache.hadoop.mapreduce.Job) for how this is done. </para>
|
||||||
|
<note>
|
||||||
|
<para> The example may not work if you are running HBase from its build directory rather
|
||||||
|
than an installed location. You may see an error like the following:</para>
|
||||||
|
<screen>java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.mapreduce.RowCounter$RowCounterMapper</screen>
|
||||||
|
<para>If this occurs, try modifying the command as follows, so that it uses the HBase JARs
|
||||||
|
from the <filename>target/</filename> directory within the build environment.</para>
|
||||||
|
<screen>$ <userinput>HADOOP_CLASSPATH=${HBASE_HOME}/target/hbase-0.90.0-SNAPSHOT.jar:`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/target/hbase-0.90.0-SNAPSHOT.jar rowcounter usertable</userinput></screen>
|
||||||
|
</note>
|
||||||
|
<caution>
|
||||||
|
<title>Notice to Mapreduce users of HBase 0.96.1 and above</title>
|
||||||
|
<para>Some mapreduce jobs that use HBase fail to launch. The symptom is an exception similar
|
||||||
|
to the following:</para>
|
||||||
|
<screen>
|
||||||
Exception in thread "main" java.lang.IllegalAccessError: class
|
Exception in thread "main" java.lang.IllegalAccessError: class
|
||||||
com.google.protobuf.ZeroCopyLiteralByteString cannot access its superclass
|
com.google.protobuf.ZeroCopyLiteralByteString cannot access its superclass
|
||||||
com.google.protobuf.LiteralByteString
|
com.google.protobuf.LiteralByteString
|
||||||
|
@ -703,63 +761,158 @@ Exception in thread "main" java.lang.IllegalAccessError: class
|
||||||
at
|
at
|
||||||
org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:100)
|
org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:100)
|
||||||
...
|
...
|
||||||
</programlisting>
|
</screen>
|
||||||
This is because of an optimization introduced in <link
|
<para>This is caused by an optimization introduced in <link
|
||||||
xlink:href="https://issues.apache.org/jira/browse/HBASE-9867">HBASE-9867</link>
|
xlink:href="https://issues.apache.org/jira/browse/HBASE-9867">HBASE-9867</link> that
|
||||||
that inadvertently introduced a classloader dependency.
|
inadvertently introduced a classloader dependency. </para>
|
||||||
</para>
|
<para>This affects both jobs using the <code>-libjars</code> option and "fat jar," those
|
||||||
<para>This affects both jobs using the <code>-libjars</code> option and
|
which package their runtime dependencies in a nested <code>lib</code> folder.</para>
|
||||||
"fat jar," those which package their runtime dependencies in a nested
|
<para>In order to satisfy the new classloader requirements, hbase-protocol.jar must be
|
||||||
<code>lib</code> folder.</para>
|
included in Hadoop's classpath. See <xref
|
||||||
<para>In order to satisfy the new classloader requirements,
|
linkend="hbase.mapreduce.classpath" /> for current recommendations for resolving
|
||||||
hbase-protocol.jar must be included in Hadoop's classpath. This can be
|
classpath errors. The following is included for historical purposes.</para>
|
||||||
resolved system-wide by including a reference to the hbase-protocol.jar in
|
<para>This can be resolved system-wide by including a reference to the hbase-protocol.jar in
|
||||||
hadoop's lib directory, via a symlink or by copying the jar into the new
|
hadoop's lib directory, via a symlink or by copying the jar into the new location.</para>
|
||||||
location.</para>
|
<para>This can also be achieved on a per-job launch basis by including it in the
|
||||||
<para>This can also be achieved on a per-job launch basis by including it
|
<code>HADOOP_CLASSPATH</code> environment variable at job submission time. When
|
||||||
in the <code>HADOOP_CLASSPATH</code> environment variable at job submission
|
launching jobs that package their dependencies, all three of the following job launching
|
||||||
time. When launching jobs that package their dependencies, all three of the
|
commands satisfy this requirement:</para>
|
||||||
following job launching commands satisfy this requirement:</para>
|
<screen>
|
||||||
<programlisting>
|
$ <userinput>HADOOP_CLASSPATH=/path/to/hbase-protocol.jar:/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass</userinput>
|
||||||
$ HADOOP_CLASSPATH=/path/to/hbase-protocol.jar:/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass
|
$ <userinput>HADOOP_CLASSPATH=$(hbase mapredcp):/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass</userinput>
|
||||||
$ HADOOP_CLASSPATH=$(hbase mapredcp):/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass
|
$ <userinput>HADOOP_CLASSPATH=$(hbase classpath) hadoop jar MyJob.jar MyJobMainClass</userinput>
|
||||||
$ HADOOP_CLASSPATH=$(hbase classpath) hadoop jar MyJob.jar MyJobMainClass
|
</screen>
|
||||||
</programlisting>
|
<para>For jars that do not package their dependencies, the following command structure is
|
||||||
<para>For jars that do not package their dependencies, the following command
|
necessary:</para>
|
||||||
structure is necessary:</para>
|
<screen>
|
||||||
<programlisting>
|
$ <userinput>HADOOP_CLASSPATH=$(hbase mapredcp):/etc/hbase/conf hadoop jar MyApp.jar MyJobMainClass -libjars $(hbase mapredcp | tr ':' ',')</userinput> ...
|
||||||
$ HADOOP_CLASSPATH=$(hbase mapredcp):/etc/hbase/conf hadoop jar MyApp.jar MyJobMainClass -libjars $(hbase mapredcp | tr ':' ',') ...
|
</screen>
|
||||||
</programlisting>
|
<para>See also <link
|
||||||
<para>See also <link
|
xlink:href="https://issues.apache.org/jira/browse/HBASE-10304">HBASE-10304</link> for
|
||||||
xlink:href="https://issues.apache.org/jira/browse/HBASE-10304">HBASE-10304</link>
|
further discussion of this issue.</para>
|
||||||
for further discussion of this issue.</para>
|
</caution>
|
||||||
</caution>
|
|
||||||
<section xml:id="splitter">
|
|
||||||
<title>Map-Task Splitting</title>
|
|
||||||
<section xml:id="splitter.default">
|
|
||||||
<title>The Default HBase MapReduce Splitter</title>
|
|
||||||
<para>When <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html">TableInputFormat</link>
|
|
||||||
is used to source an HBase table in a MapReduce job,
|
|
||||||
its splitter will make a map task for each region of the table.
|
|
||||||
Thus, if there are 100 regions in the table, there will be
|
|
||||||
100 map-tasks for the job - regardless of how many column families are selected in the Scan.</para>
|
|
||||||
</section>
|
</section>
|
||||||
<section xml:id="splitter.custom">
|
|
||||||
<title>Custom Splitters</title>
|
|
||||||
<para>For those interested in implementing custom splitters, see the method <code>getSplits</code> in
|
<section>
|
||||||
<link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.html">TableInputFormatBase</link>.
|
<title>Bundled HBase MapReduce Jobs</title>
|
||||||
That is where the logic for map-task assignment resides.
|
<para>The HBase JAR also serves as a Driver for some bundled mapreduce jobs. To learn about
|
||||||
</para>
|
the bundled MapReduce jobs, run the following command.</para>
|
||||||
|
|
||||||
|
<screen>$ <userinput>${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-0.90.0-SNAPSHOT.jar</userinput>
|
||||||
|
<computeroutput>An example program must be given as the first argument.
|
||||||
|
Valid program names are:
|
||||||
|
copytable: Export a table from local cluster to peer cluster
|
||||||
|
completebulkload: Complete a bulk data load.
|
||||||
|
export: Write table data to HDFS.
|
||||||
|
import: Import data written by Export.
|
||||||
|
importtsv: Import data in TSV format.
|
||||||
|
rowcounter: Count rows in HBase table</computeroutput>
|
||||||
|
</screen>
|
||||||
|
<para>Each of the valid program names are bundled MapReduce jobs. To run one of the jobs,
|
||||||
|
model your command after the following example.</para>
|
||||||
|
<screen>$ <userinput>${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-0.90.0-SNAPSHOT.jar rowcounter myTable</userinput></screen>
|
||||||
</section>
|
</section>
|
||||||
</section>
|
|
||||||
<section xml:id="mapreduce.example">
|
<section>
|
||||||
<title>HBase MapReduce Examples</title>
|
<title>HBase as a MapReduce Job Data Source and Data Sink</title>
|
||||||
<section xml:id="mapreduce.example.read">
|
<para>HBase can be used as a data source, <link
|
||||||
<title>HBase MapReduce Read Example</title>
|
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html">TableInputFormat</link>,
|
||||||
<para>The following is an example of using HBase as a MapReduce source in read-only manner. Specifically,
|
and data sink, <link
|
||||||
there is a Mapper instance but no Reducer, and nothing is being emitted from the Mapper. There job would be defined
|
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html">TableOutputFormat</link>
|
||||||
as follows...
|
or <link
|
||||||
<programlisting>
|
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/MultiTableOutputFormat.html">MultiTableOutputFormat</link>,
|
||||||
|
for MapReduce jobs. Writing MapReduce jobs that read or write HBase, it is advisable to
|
||||||
|
subclass <link
|
||||||
|
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableMapper.html">TableMapper</link>
|
||||||
|
and/or <link
|
||||||
|
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableReducer.html">TableReducer</link>.
|
||||||
|
See the do-nothing pass-through classes <link
|
||||||
|
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/IdentityTableMapper.html">IdentityTableMapper</link>
|
||||||
|
and <link
|
||||||
|
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/IdentityTableReducer.html">IdentityTableReducer</link>
|
||||||
|
for basic usage. For a more involved example, see <link
|
||||||
|
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html">RowCounter</link>
|
||||||
|
or review the <code>org.apache.hadoop.hbase.mapreduce.TestTableMapReduce</code> unit test. </para>
|
||||||
|
<para>If you run MapReduce jobs that use HBase as source or sink, need to specify source and
|
||||||
|
sink table and column names in your configuration.</para>
|
||||||
|
|
||||||
|
<para>When you read from HBase, the <code>TableInputFormat</code> requests the list of regions
|
||||||
|
from HBase and makes a map, which is either a <code>map-per-region</code> or
|
||||||
|
<code>mapreduce.job.maps</code> map, whichever is smaller. If your job only has two maps,
|
||||||
|
raise <code>mapreduce.job.maps</code> to a number greater than the number of regions. Maps
|
||||||
|
will run on the adjacent TaskTracker if you are running a TaskTracer and RegionServer per
|
||||||
|
node. When writing to HBase, it may make sense to avoid the Reduce step and write back into
|
||||||
|
HBase from within your map. This approach works when your job does not need the sort and
|
||||||
|
collation that MapReduce does on the map-emitted data. On insert, HBase 'sorts' so there is
|
||||||
|
no point double-sorting (and shuffling data around your MapReduce cluster) unless you need
|
||||||
|
to. If you do not need the Reduce, you myour map might emit counts of records processed for
|
||||||
|
reporting at the end of the jobj, or set the number of Reduces to zero and use
|
||||||
|
TableOutputFormat. If running the Reduce step makes sense in your case, you should typically
|
||||||
|
use multiple reducers so that load is spread across the HBase cluster.</para>
|
||||||
|
|
||||||
|
<para>A new HBase partitioner, the <link
|
||||||
|
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/HRegionPartitioner.html">HRegionPartitioner</link>,
|
||||||
|
can run as many reducers the number of existing regions. The HRegionPartitioner is suitable
|
||||||
|
when your table is large and your upload will not greatly alter the number of existing
|
||||||
|
regions upon completion. Otherwise use the default partitioner. </para>
|
||||||
|
</section>
|
||||||
|
|
||||||
|
<section>
|
||||||
|
<title>Writing HFiles Directly During Bulk Import</title>
|
||||||
|
<para>If you are importing into a new table, you can bypass the HBase API and write your
|
||||||
|
content directly to the filesystem, formatted into HBase data files (HFiles). Your import
|
||||||
|
will run faster, perhaps an order of magnitude faster. For more on how this mechanism works,
|
||||||
|
see <xref
|
||||||
|
linkend="arch.bulk.load" />.</para>
|
||||||
|
</section>
|
||||||
|
|
||||||
|
<section>
|
||||||
|
<title>RowCounter Example</title>
|
||||||
|
<para>The included <link
|
||||||
|
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html">RowCounter</link>
|
||||||
|
MapReduce job uses <code>TableInputFormat</code> and does a count of all rows in the specified
|
||||||
|
table. To run it, use the following command: </para>
|
||||||
|
<screen>$ <userinput>./bin/hadoop jar hbase-X.X.X.jar</userinput></screen>
|
||||||
|
<para>This will
|
||||||
|
invoke the HBase MapReduce Driver class. Select <literal>rowcounter</literal> from the choice of jobs
|
||||||
|
offered. This will print rowcouner usage advice to standard output. Specify the tablename,
|
||||||
|
column to count, and output
|
||||||
|
directory. If you have classpath errors, see <xref linkend="hbase.mapreduce.classpath" />.</para>
|
||||||
|
</section>
|
||||||
|
|
||||||
|
<section
|
||||||
|
xml:id="splitter">
|
||||||
|
<title>Map-Task Splitting</title>
|
||||||
|
<section
|
||||||
|
xml:id="splitter.default">
|
||||||
|
<title>The Default HBase MapReduce Splitter</title>
|
||||||
|
<para>When <link
|
||||||
|
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html">TableInputFormat</link>
|
||||||
|
is used to source an HBase table in a MapReduce job, its splitter will make a map task for
|
||||||
|
each region of the table. Thus, if there are 100 regions in the table, there will be 100
|
||||||
|
map-tasks for the job - regardless of how many column families are selected in the
|
||||||
|
Scan.</para>
|
||||||
|
</section>
|
||||||
|
<section
|
||||||
|
xml:id="splitter.custom">
|
||||||
|
<title>Custom Splitters</title>
|
||||||
|
<para>For those interested in implementing custom splitters, see the method
|
||||||
|
<code>getSplits</code> in <link
|
||||||
|
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.html">TableInputFormatBase</link>.
|
||||||
|
That is where the logic for map-task assignment resides. </para>
|
||||||
|
</section>
|
||||||
|
</section>
|
||||||
|
<section
|
||||||
|
xml:id="mapreduce.example">
|
||||||
|
<title>HBase MapReduce Examples</title>
|
||||||
|
<section
|
||||||
|
xml:id="mapreduce.example.read">
|
||||||
|
<title>HBase MapReduce Read Example</title>
|
||||||
|
<para>The following is an example of using HBase as a MapReduce source in read-only manner.
|
||||||
|
Specifically, there is a Mapper instance but no Reducer, and nothing is being emitted from
|
||||||
|
the Mapper. There job would be defined as follows...</para>
|
||||||
|
<programlisting>
|
||||||
Configuration config = HBaseConfiguration.create();
|
Configuration config = HBaseConfiguration.create();
|
||||||
Job job = new Job(config, "ExampleRead");
|
Job job = new Job(config, "ExampleRead");
|
||||||
job.setJarByClass(MyReadJob.class); // class that contains mapper
|
job.setJarByClass(MyReadJob.class); // class that contains mapper
|
||||||
|
@ -784,8 +937,9 @@ if (!b) {
|
||||||
throw new IOException("error with job!");
|
throw new IOException("error with job!");
|
||||||
}
|
}
|
||||||
</programlisting>
|
</programlisting>
|
||||||
...and the mapper instance would extend <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableMapper.html">TableMapper</link>...
|
<para>...and the mapper instance would extend <link
|
||||||
<programlisting>
|
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableMapper.html">TableMapper</link>...</para>
|
||||||
|
<programlisting>
|
||||||
public static class MyMapper extends TableMapper<Text, Text> {
|
public static class MyMapper extends TableMapper<Text, Text> {
|
||||||
|
|
||||||
public void map(ImmutableBytesWritable row, Result value, Context context) throws InterruptedException, IOException {
|
public void map(ImmutableBytesWritable row, Result value, Context context) throws InterruptedException, IOException {
|
||||||
|
@ -793,13 +947,13 @@ public static class MyMapper extends TableMapper<Text, Text> {
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
</programlisting>
|
</programlisting>
|
||||||
</para>
|
</section>
|
||||||
</section>
|
<section
|
||||||
<section xml:id="mapreduce.example.readwrite">
|
xml:id="mapreduce.example.readwrite">
|
||||||
<title>HBase MapReduce Read/Write Example</title>
|
<title>HBase MapReduce Read/Write Example</title>
|
||||||
<para>The following is an example of using HBase both as a source and as a sink with MapReduce.
|
<para>The following is an example of using HBase both as a source and as a sink with
|
||||||
This example will simply copy data from one table to another.</para>
|
MapReduce. This example will simply copy data from one table to another.</para>
|
||||||
<programlisting>
|
<programlisting>
|
||||||
Configuration config = HBaseConfiguration.create();
|
Configuration config = HBaseConfiguration.create();
|
||||||
Job job = new Job(config,"ExampleReadWrite");
|
Job job = new Job(config,"ExampleReadWrite");
|
||||||
job.setJarByClass(MyReadWriteJob.class); // class that contains mapper
|
job.setJarByClass(MyReadWriteJob.class); // class that contains mapper
|
||||||
|
@ -827,15 +981,18 @@ if (!b) {
|
||||||
throw new IOException("error with job!");
|
throw new IOException("error with job!");
|
||||||
}
|
}
|
||||||
</programlisting>
|
</programlisting>
|
||||||
<para>An explanation is required of what <classname>TableMapReduceUtil</classname> is doing, especially with the reducer.
|
<para>An explanation is required of what <classname>TableMapReduceUtil</classname> is doing,
|
||||||
<link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html">TableOutputFormat</link> is being used
|
especially with the reducer. <link
|
||||||
as the outputFormat class, and several parameters are being set on the config (e.g., TableOutputFormat.OUTPUT_TABLE), as
|
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html">TableOutputFormat</link>
|
||||||
well as setting the reducer output key to <classname>ImmutableBytesWritable</classname> and reducer value to <classname>Writable</classname>.
|
is being used as the outputFormat class, and several parameters are being set on the
|
||||||
These could be set by the programmer on the job and conf, but <classname>TableMapReduceUtil</classname> tries to make things easier.</para>
|
config (e.g., TableOutputFormat.OUTPUT_TABLE), as well as setting the reducer output key
|
||||||
<para>The following is the example mapper, which will create a <classname>Put</classname> and matching the input <classname>Result</classname>
|
to <classname>ImmutableBytesWritable</classname> and reducer value to
|
||||||
and emit it. Note: this is what the CopyTable utility does.
|
<classname>Writable</classname>. These could be set by the programmer on the job and
|
||||||
</para>
|
conf, but <classname>TableMapReduceUtil</classname> tries to make things easier.</para>
|
||||||
<programlisting>
|
<para>The following is the example mapper, which will create a <classname>Put</classname>
|
||||||
|
and matching the input <classname>Result</classname> and emit it. Note: this is what the
|
||||||
|
CopyTable utility does. </para>
|
||||||
|
<programlisting>
|
||||||
public static class MyMapper extends TableMapper<ImmutableBytesWritable, Put> {
|
public static class MyMapper extends TableMapper<ImmutableBytesWritable, Put> {
|
||||||
|
|
||||||
public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {
|
public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {
|
||||||
|
@ -852,23 +1009,24 @@ public static class MyMapper extends TableMapper<ImmutableBytesWritable, Put&
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
</programlisting>
|
</programlisting>
|
||||||
<para>There isn't actually a reducer step, so <classname>TableOutputFormat</classname> takes care of sending the <classname>Put</classname>
|
<para>There isn't actually a reducer step, so <classname>TableOutputFormat</classname> takes
|
||||||
to the target table.
|
care of sending the <classname>Put</classname> to the target table. </para>
|
||||||
</para>
|
<para>This is just an example, developers could choose not to use
|
||||||
<para>This is just an example, developers could choose not to use <classname>TableOutputFormat</classname> and connect to the
|
<classname>TableOutputFormat</classname> and connect to the target table themselves.
|
||||||
target table themselves.
|
</para>
|
||||||
</para>
|
</section>
|
||||||
</section>
|
<section
|
||||||
<section xml:id="mapreduce.example.readwrite.multi">
|
xml:id="mapreduce.example.readwrite.multi">
|
||||||
<title>HBase MapReduce Read/Write Example With Multi-Table Output</title>
|
<title>HBase MapReduce Read/Write Example With Multi-Table Output</title>
|
||||||
<para>TODO: example for <classname>MultiTableOutputFormat</classname>.
|
<para>TODO: example for <classname>MultiTableOutputFormat</classname>. </para>
|
||||||
</para>
|
</section>
|
||||||
</section>
|
<section
|
||||||
<section xml:id="mapreduce.example.summary">
|
xml:id="mapreduce.example.summary">
|
||||||
<title>HBase MapReduce Summary to HBase Example</title>
|
<title>HBase MapReduce Summary to HBase Example</title>
|
||||||
<para>The following example uses HBase as a MapReduce source and sink with a summarization step. This example will
|
<para>The following example uses HBase as a MapReduce source and sink with a summarization
|
||||||
count the number of distinct instances of a value in a table and write those summarized counts in another table.
|
step. This example will count the number of distinct instances of a value in a table and
|
||||||
<programlisting>
|
write those summarized counts in another table.
|
||||||
|
<programlisting>
|
||||||
Configuration config = HBaseConfiguration.create();
|
Configuration config = HBaseConfiguration.create();
|
||||||
Job job = new Job(config,"ExampleSummary");
|
Job job = new Job(config,"ExampleSummary");
|
||||||
job.setJarByClass(MySummaryJob.class); // class that contains mapper and reducer
|
job.setJarByClass(MySummaryJob.class); // class that contains mapper and reducer
|
||||||
|
@ -896,9 +1054,10 @@ if (!b) {
|
||||||
throw new IOException("error with job!");
|
throw new IOException("error with job!");
|
||||||
}
|
}
|
||||||
</programlisting>
|
</programlisting>
|
||||||
In this example mapper a column with a String-value is chosen as the value to summarize upon.
|
In this example mapper a column with a String-value is chosen as the value to summarize
|
||||||
This value is used as the key to emit from the mapper, and an <classname>IntWritable</classname> represents an instance counter.
|
upon. This value is used as the key to emit from the mapper, and an
|
||||||
<programlisting>
|
<classname>IntWritable</classname> represents an instance counter.
|
||||||
|
<programlisting>
|
||||||
public static class MyMapper extends TableMapper<Text, IntWritable> {
|
public static class MyMapper extends TableMapper<Text, IntWritable> {
|
||||||
public static final byte[] CF = "cf".getBytes();
|
public static final byte[] CF = "cf".getBytes();
|
||||||
public static final byte[] ATTR1 = "attr1".getBytes();
|
public static final byte[] ATTR1 = "attr1".getBytes();
|
||||||
|
@ -914,8 +1073,9 @@ public static class MyMapper extends TableMapper<Text, IntWritable> {
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
</programlisting>
|
</programlisting>
|
||||||
In the reducer, the "ones" are counted (just like any other MR example that does this), and then emits a <classname>Put</classname>.
|
In the reducer, the "ones" are counted (just like any other MR example that does this),
|
||||||
<programlisting>
|
and then emits a <classname>Put</classname>.
|
||||||
|
<programlisting>
|
||||||
public static class MyTableReducer extends TableReducer<Text, IntWritable, ImmutableBytesWritable> {
|
public static class MyTableReducer extends TableReducer<Text, IntWritable, ImmutableBytesWritable> {
|
||||||
public static final byte[] CF = "cf".getBytes();
|
public static final byte[] CF = "cf".getBytes();
|
||||||
public static final byte[] COUNT = "count".getBytes();
|
public static final byte[] COUNT = "count".getBytes();
|
||||||
|
@ -932,14 +1092,15 @@ public static class MyTableReducer extends TableReducer<Text, IntWritable, Im
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
</programlisting>
|
</programlisting>
|
||||||
</para>
|
</para>
|
||||||
</section>
|
</section>
|
||||||
<section xml:id="mapreduce.example.summary.file">
|
<section
|
||||||
<title>HBase MapReduce Summary to File Example</title>
|
xml:id="mapreduce.example.summary.file">
|
||||||
<para>This very similar to the summary example above, with exception that this is using HBase as a MapReduce source
|
<title>HBase MapReduce Summary to File Example</title>
|
||||||
but HDFS as the sink. The differences are in the job setup and in the reducer. The mapper remains the same.
|
<para>This very similar to the summary example above, with exception that this is using
|
||||||
</para>
|
HBase as a MapReduce source but HDFS as the sink. The differences are in the job setup and
|
||||||
<programlisting>
|
in the reducer. The mapper remains the same. </para>
|
||||||
|
<programlisting>
|
||||||
Configuration config = HBaseConfiguration.create();
|
Configuration config = HBaseConfiguration.create();
|
||||||
Job job = new Job(config,"ExampleSummaryToFile");
|
Job job = new Job(config,"ExampleSummaryToFile");
|
||||||
job.setJarByClass(MySummaryFileJob.class); // class that contains mapper and reducer
|
job.setJarByClass(MySummaryFileJob.class); // class that contains mapper and reducer
|
||||||
|
@ -965,9 +1126,10 @@ if (!b) {
|
||||||
throw new IOException("error with job!");
|
throw new IOException("error with job!");
|
||||||
}
|
}
|
||||||
</programlisting>
|
</programlisting>
|
||||||
<para>As stated above, the previous Mapper can run unchanged with this example.
|
<para>As stated above, the previous Mapper can run unchanged with this example. As for the
|
||||||
As for the Reducer, it is a "generic" Reducer instead of extending TableMapper and emitting Puts.</para>
|
Reducer, it is a "generic" Reducer instead of extending TableMapper and emitting
|
||||||
<programlisting>
|
Puts.</para>
|
||||||
|
<programlisting>
|
||||||
public static class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
|
public static class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
|
||||||
|
|
||||||
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
|
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
|
||||||
|
@ -979,33 +1141,35 @@ if (!b) {
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
</programlisting>
|
</programlisting>
|
||||||
</section>
|
</section>
|
||||||
<section xml:id="mapreduce.example.summary.noreducer">
|
<section
|
||||||
<title>HBase MapReduce Summary to HBase Without Reducer</title>
|
xml:id="mapreduce.example.summary.noreducer">
|
||||||
<para>It is also possible to perform summaries without a reducer - if you use HBase as the reducer.
|
<title>HBase MapReduce Summary to HBase Without Reducer</title>
|
||||||
</para>
|
<para>It is also possible to perform summaries without a reducer - if you use HBase as the
|
||||||
<para>An HBase target table would need to exist for the job summary. The HTable method <code>incrementColumnValue</code>
|
reducer. </para>
|
||||||
would be used to atomically increment values. From a performance perspective, it might make sense to keep a Map
|
<para>An HBase target table would need to exist for the job summary. The HTable method
|
||||||
of values with their values to be incremeneted for each map-task, and make one update per key at during the <code>
|
<code>incrementColumnValue</code> would be used to atomically increment values. From a
|
||||||
cleanup</code> method of the mapper. However, your milage may vary depending on the number of rows to be processed and
|
performance perspective, it might make sense to keep a Map of values with their values to
|
||||||
unique keys.
|
be incremeneted for each map-task, and make one update per key at during the <code>
|
||||||
</para>
|
cleanup</code> method of the mapper. However, your milage may vary depending on the
|
||||||
<para>In the end, the summary results are in HBase.
|
number of rows to be processed and unique keys. </para>
|
||||||
</para>
|
<para>In the end, the summary results are in HBase. </para>
|
||||||
</section>
|
</section>
|
||||||
<section xml:id="mapreduce.example.summary.rdbms">
|
<section
|
||||||
<title>HBase MapReduce Summary to RDBMS</title>
|
xml:id="mapreduce.example.summary.rdbms">
|
||||||
<para>Sometimes it is more appropriate to generate summaries to an RDBMS. For these cases, it is possible
|
<title>HBase MapReduce Summary to RDBMS</title>
|
||||||
to generate summaries directly to an RDBMS via a custom reducer. The <code>setup</code> method
|
<para>Sometimes it is more appropriate to generate summaries to an RDBMS. For these cases,
|
||||||
can connect to an RDBMS (the connection information can be passed via custom parameters in the context) and the
|
it is possible to generate summaries directly to an RDBMS via a custom reducer. The
|
||||||
cleanup method can close the connection.
|
<code>setup</code> method can connect to an RDBMS (the connection information can be
|
||||||
</para>
|
passed via custom parameters in the context) and the cleanup method can close the
|
||||||
<para>It is critical to understand that number of reducers for the job affects the summarization implementation, and
|
connection. </para>
|
||||||
you'll have to design this into your reducer. Specifically, whether it is designed to run as a singleton (one reducer)
|
<para>It is critical to understand that number of reducers for the job affects the
|
||||||
or multiple reducers. Neither is right or wrong, it depends on your use-case. Recognize that the more reducers that
|
summarization implementation, and you'll have to design this into your reducer.
|
||||||
are assigned to the job, the more simultaneous connections to the RDBMS will be created - this will scale, but only to a point.
|
Specifically, whether it is designed to run as a singleton (one reducer) or multiple
|
||||||
</para>
|
reducers. Neither is right or wrong, it depends on your use-case. Recognize that the more
|
||||||
<programlisting>
|
reducers that are assigned to the job, the more simultaneous connections to the RDBMS will
|
||||||
|
be created - this will scale, but only to a point. </para>
|
||||||
|
<programlisting>
|
||||||
public static class MyRdbmsReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
|
public static class MyRdbmsReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
|
||||||
|
|
||||||
private Connection c = null;
|
private Connection c = null;
|
||||||
|
@ -1025,18 +1189,18 @@ if (!b) {
|
||||||
|
|
||||||
}
|
}
|
||||||
</programlisting>
|
</programlisting>
|
||||||
<para>In the end, the summary results are written to your RDBMS table/s.
|
<para>In the end, the summary results are written to your RDBMS table/s. </para>
|
||||||
</para>
|
</section>
|
||||||
</section>
|
|
||||||
|
|
||||||
</section> <!-- mr examples -->
|
</section>
|
||||||
<section xml:id="mapreduce.htable.access">
|
<!-- mr examples -->
|
||||||
<title>Accessing Other HBase Tables in a MapReduce Job</title>
|
<section
|
||||||
<para>Although the framework currently allows one HBase table as input to a
|
xml:id="mapreduce.htable.access">
|
||||||
MapReduce job, other HBase tables can
|
<title>Accessing Other HBase Tables in a MapReduce Job</title>
|
||||||
be accessed as lookup tables, etc., in a
|
<para>Although the framework currently allows one HBase table as input to a MapReduce job,
|
||||||
MapReduce job via creating an HTable instance in the setup method of the Mapper.
|
other HBase tables can be accessed as lookup tables, etc., in a MapReduce job via creating
|
||||||
<programlisting>public class MyMapper extends TableMapper<Text, LongWritable> {
|
an HTable instance in the setup method of the Mapper.
|
||||||
|
<programlisting>public class MyMapper extends TableMapper<Text, LongWritable> {
|
||||||
private HTable myOtherTable;
|
private HTable myOtherTable;
|
||||||
|
|
||||||
public void setup(Context context) {
|
public void setup(Context context) {
|
||||||
|
@ -1049,20 +1213,19 @@ if (!b) {
|
||||||
}
|
}
|
||||||
|
|
||||||
</programlisting>
|
</programlisting>
|
||||||
</para>
|
</para>
|
||||||
|
</section>
|
||||||
|
<section
|
||||||
|
xml:id="mapreduce.specex">
|
||||||
|
<title>Speculative Execution</title>
|
||||||
|
<para>It is generally advisable to turn off speculative execution for MapReduce jobs that use
|
||||||
|
HBase as a source. This can either be done on a per-Job basis through properties, on on the
|
||||||
|
entire cluster. Especially for longer running jobs, speculative execution will create
|
||||||
|
duplicate map-tasks which will double-write your data to HBase; this is probably not what
|
||||||
|
you want. </para>
|
||||||
|
<para>See <xref
|
||||||
|
linkend="spec.ex" /> for more information. </para>
|
||||||
</section>
|
</section>
|
||||||
<section xml:id="mapreduce.specex">
|
|
||||||
<title>Speculative Execution</title>
|
|
||||||
<para>It is generally advisable to turn off speculative execution for
|
|
||||||
MapReduce jobs that use HBase as a source. This can either be done on a
|
|
||||||
per-Job basis through properties, on on the entire cluster. Especially
|
|
||||||
for longer running jobs, speculative execution will create duplicate
|
|
||||||
map-tasks which will double-write your data to HBase; this is probably
|
|
||||||
not what you want.
|
|
||||||
</para>
|
|
||||||
<para>See <xref linkend="spec.ex"/> for more information.
|
|
||||||
</para>
|
|
||||||
</section>
|
|
||||||
</chapter> <!-- mapreduce -->
|
</chapter> <!-- mapreduce -->
|
||||||
|
|
||||||
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="security.xml" />
|
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="security.xml" />
|
||||||
|
|
Loading…
Reference in New Issue