HBASE-8807 HBase MapReduce Job-Launch Documentation Misplaced (Misty Stanley-Jones)

2014-05-27 15:13:23 -07:00 · 2014-05-27 15:13:23 -07:00 · ef995efb1a
parent d9739b9e3f
commit ef995efb1a
3 changed files with 336 additions and 407 deletions
--- a/hbase-server/src/main/java/org/apache/hadoop/hbase/mapred/package-info.java
+++ b/hbase-server/src/main/java/org/apache/hadoop/hbase/mapred/package-info.java
@ -20,104 +20,7 @@
 Provides HBase <a href="http://wiki.apache.org/hadoop/HadoopMapReduce">MapReduce</a>
 Input/OutputFormats, a table indexing MapReduce job, and utility
-<h2>Table of Contents</h2>
+<p>See <a href="http://hbase.apache.org/book.html#mapreduce">HBase and MapReduce</a>
-<ul>
+in the HBase Reference Guide for mapreduce over hbase documentation. 
 <li><a href="#classpath">HBase, MapReduce and the CLASSPATH</a></li>
 <li><a href="#sink">HBase as MapReduce job data source and sink</a></li>
 <li><a href="#examples">Example Code</a></li>
 </ul>
 <h2><a name="classpath">HBase, MapReduce and the CLASSPATH</a></h2>
 <p>MapReduce jobs deployed to a MapReduce cluster do not by default have access
 to the HBase configuration under <code>$HBASE_CONF_DIR</code> nor to HBase classes.
 You could add <code>hbase-site.xml</code> to $HADOOP_HOME/conf and add
 <code>hbase-X.X.X.jar</code> to the <code>$HADOOP_HOME/lib</code> and copy these
 changes across your cluster but the cleanest means of adding hbase configuration
 and classes to the cluster <code>CLASSPATH</code> is by uncommenting
 <code>HADOOP_CLASSPATH</code> in <code>$HADOOP_HOME/conf/hadoop-env.sh</code>
 adding hbase dependencies here.  For example, here is how you would amend
 <code>hadoop-env.sh</code> adding the
 built hbase jar, zookeeper (needed by hbase client), hbase conf, and the
 <code>PerformanceEvaluation</code> class from the built hbase test jar to the
 hadoop <code>CLASSPATH</code>:
 <blockquote><pre># Extra Java CLASSPATH elements. Optional.
 # export HADOOP_CLASSPATH=
 export HADOOP_CLASSPATH=$HBASE_HOME/build/hbase-X.X.X.jar:$HBASE_HOME/build/hbase-X.X.X-test.jar:$HBASE_HOME/conf:${HBASE_HOME}/lib/zookeeper-X.X.X.jar</pre></blockquote>
 <p>Expand <code>$HBASE_HOME</code> in the above appropriately to suit your
 local environment.</p>
 <p>After copying the above change around your cluster (and restarting), this is
 how you would run the PerformanceEvaluation MR job to put up 4 clients (Presumes
 a ready mapreduce cluster):
 <blockquote><pre>$HADOOP_HOME/bin/hadoop org.apache.hadoop.hbase.PerformanceEvaluation sequentialWrite 4</pre></blockquote>
 The PerformanceEvaluation class wil be found on the CLASSPATH because you
 added <code>$HBASE_HOME/build/test</code> to HADOOP_CLASSPATH
 </p>
 <p>Another possibility, if for example you do not have access to hadoop-env.sh or
 are unable to restart the hadoop cluster, is bundling the hbase jar into a mapreduce
 job jar adding it and its dependencies under the job jar <code>lib/</code>
 directory and the hbase conf into a job jar <code>conf/</code> directory.
 </a>
 <h2><a name="sink">HBase as MapReduce job data source and sink</a></h2>
 <p>HBase can be used as a data source, {@link org.apache.hadoop.hbase.mapred.TableInputFormat TableInputFormat},
 and data sink, {@link org.apache.hadoop.hbase.mapred.TableOutputFormat TableOutputFormat}, for MapReduce jobs.
 Writing MapReduce jobs that read or write HBase, you'll probably want to subclass
 {@link org.apache.hadoop.hbase.mapred.TableMap TableMap} and/or
 {@link org.apache.hadoop.hbase.mapred.TableReduce TableReduce}.  See the do-nothing
 pass-through classes {@link org.apache.hadoop.hbase.mapred.IdentityTableMap IdentityTableMap} and
 {@link org.apache.hadoop.hbase.mapred.IdentityTableReduce IdentityTableReduce} for basic usage.  For a more
 involved example, see <code>BuildTableIndex</code>
 or review the <code>org.apache.hadoop.hbase.mapred.TestTableMapReduce</code> unit test.
 </p>
 <p>Running mapreduce jobs that have hbase as source or sink, you'll need to
 specify source/sink table and column names in your configuration.</p>
 <p>Reading from hbase, the TableInputFormat asks hbase for the list of
 regions and makes a map-per-region or <code>mapred.map.tasks maps</code>,
 whichever is smaller (If your job only has two maps, up mapred.map.tasks
 to a number > number of regions). Maps will run on the adjacent TaskTracker
 if you are running a TaskTracer and RegionServer per node.
 Writing, it may make sense to avoid the reduce step and write yourself back into
 hbase from inside your map. You'd do this when your job does not need the sort
 and collation that mapreduce does on the map emitted data; on insert,
 hbase 'sorts' so there is no point double-sorting (and shuffling data around
 your mapreduce cluster) unless you need to. If you do not need the reduce,
 you might just have your map emit counts of records processed just so the
 framework's report at the end of your job has meaning or set the number of
 reduces to zero and use TableOutputFormat. See example code
 below. If running the reduce step makes sense in your case, its usually better
 to have lots of reducers so load is spread across the hbase cluster.</p>
 <p>There is also a new hbase partitioner that will run as many reducers as
 currently existing regions.  The
 {@link org.apache.hadoop.hbase.mapred.HRegionPartitioner} is suitable
 when your table is large and your upload is not such that it will greatly
 alter the number of existing regions when done; other use the default
 partitioner.
 </p>
 <h2><a name="examples">Example Code</a></h2>
 <h3>Sample Row Counter</h3>
 <p>See {@link org.apache.hadoop.hbase.mapred.RowCounter}.  You should be able to run
 it by doing: <code>% ./bin/hadoop jar hbase-X.X.X.jar</code>.  This will invoke
 the hbase MapReduce Driver class.  Select 'rowcounter' from the choice of jobs
 offered. You may need to add the hbase conf directory to <code>$HADOOP_HOME/conf/hadoop-env.sh#HADOOP_CLASSPATH</code>
 so the rowcounter gets pointed at the right hbase cluster (or, build a new jar
 with an appropriate hbase-site.xml built into your job jar).
 </p>
 <h3>PerformanceEvaluation</h3>
 <p>See org.apache.hadoop.hbase.PerformanceEvaluation from hbase src/test.  It runs
 a mapreduce job to run concurrent clients reading and writing hbase.
 </p>
 */
 package org.apache.hadoop.hbase.mapred;
--- a/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/package-info.java
+++ b/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/package-info.java
@ -20,144 +20,7 @@
 Provides HBase <a href="http://wiki.apache.org/hadoop/HadoopMapReduce">MapReduce</a>
 Input/OutputFormats, a table indexing MapReduce job, and utility
-<h2>Table of Contents</h2>
+<p>See <a href="http://hbase.apache.org/book.html#mapreduce">HBase and MapReduce</a>
-<ul>
+in the HBase Reference Guide for mapreduce over hbase documentation. 
 <li><a href="#classpath">HBase, MapReduce and the CLASSPATH</a></li>
 <li><a href="#driver">Bundled HBase MapReduce Jobs</a></li>
 <li><a href="#sink">HBase as MapReduce job data source and sink</a></li>
 <li><a href="#bulk">Bulk Import writing HFiles directly</a></li>
 <li><a href="#examples">Example Code</a></li>
 </ul>
 <h2><a name="classpath">HBase, MapReduce and the CLASSPATH</a></h2>
 <p>MapReduce jobs deployed to a MapReduce cluster do not by default have access
 to the HBase configuration under <code>$HBASE_CONF_DIR</code> nor to HBase classes.
 You could add <code>hbase-site.xml</code> to
 <code>$HADOOP_HOME/conf</code> and add
 HBase jars to the <code>$HADOOP_HOME/lib</code> and copy these
 changes across your cluster (or edit conf/hadoop-env.sh and add them to the
 <code>HADOOP_CLASSPATH</code> variable) but this will pollute your
 hadoop install with HBase references; its also obnoxious requiring restart of
 the hadoop cluster before it'll notice your HBase additions.</p>
 <p>As of 0.90.x, HBase will just add its dependency jars to the job
 configuration; the dependencies just need to be available on the local
 <code>CLASSPATH</code>.  For example, to run the bundled HBase
 {@link org.apache.hadoop.hbase.mapreduce.RowCounter} mapreduce job against a table named <code>usertable</code>,
 type:
 <blockquote><pre>
 $ HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-0.90.0.jar rowcounter usertable
 </pre></blockquote>
 Expand <code>$HBASE_HOME</code> and <code>$HADOOP_HOME</code> in the above
 appropriately to suit your local environment.  The content of <code>HADOOP_CLASSPATH</code>
 is set to the HBase <code>CLASSPATH</code> via backticking the command
 <code>${HBASE_HOME}/bin/hbase classpath</code>.
 <p>When the above runs, internally, the HBase jar finds its zookeeper and
 <a href="http://code.google.com/p/guava-libraries/">guava</a>,
 etc., dependencies on the passed
 </code>HADOOP_CLASSPATH</code> and adds the found jars to the mapreduce
 job configuration. See the source at
 <code>TableMapReduceUtil#addDependencyJars(org.apache.hadoop.mapreduce.Job)</code>
 for how this is done.
 </p>
 <p>The above may not work if you are running your HBase from its build directory;
 i.e. you've done <code>$ mvn test install</code> at
 <code>${HBASE_HOME}</code> and you are now
 trying to use this build in your mapreduce job.  If you get
 <blockquote><pre>java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.mapreduce.RowCounter$RowCounterMapper
 ...
 </pre></blockquote>
 exception thrown, try doing the following:
 <blockquote><pre>
 $ HADOOP_CLASSPATH=${HBASE_HOME}/target/hbase-0.90.0-SNAPSHOT.jar:`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/target/hbase-0.90.0-SNAPSHOT.jar rowcounter usertable
 </pre></blockquote>
 Notice how we preface the backtick invocation setting
 <code>HADOOP_CLASSPATH</code> with reference to the built HBase jar over in
 the <code>target</code> directory.
 </p>
 <h2><a name="driver">Bundled HBase MapReduce Jobs</a></h2>
 <p>The HBase jar also serves as a Driver for some bundled mapreduce jobs. To
 learn about the bundled mapreduce jobs run:
 <blockquote><pre>
 $ ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-0.90.0-SNAPSHOT.jar
 An example program must be given as the first argument.
 Valid program names are:
  copytable: Export a table from local cluster to peer cluster
  completebulkload: Complete a bulk data load.
  export: Write table data to HDFS.
  import: Import data written by Export.
  importtsv: Import data in TSV format.
  rowcounter: Count rows in HBase table
 </pre></blockquote>
 <h2><a name="sink">HBase as MapReduce job data source and sink</a></h2>
 <p>HBase can be used as a data source, {@link org.apache.hadoop.hbase.mapreduce.TableInputFormat TableInputFormat},
 and data sink, {@link org.apache.hadoop.hbase.mapreduce.TableOutputFormat TableOutputFormat}
 or {@link org.apache.hadoop.hbase.mapreduce.MultiTableOutputFormat MultiTableOutputFormat},
 for MapReduce jobs.
 Writing MapReduce jobs that read or write HBase, you'll probably want to subclass
 {@link org.apache.hadoop.hbase.mapreduce.TableMapper TableMapper} and/or
 {@link org.apache.hadoop.hbase.mapreduce.TableReducer TableReducer}.  See the do-nothing
 pass-through classes {@link org.apache.hadoop.hbase.mapreduce.IdentityTableMapper IdentityTableMapper} and
 {@link org.apache.hadoop.hbase.mapreduce.IdentityTableReducer IdentityTableReducer} for basic usage.  For a more
 involved example, see {@link org.apache.hadoop.hbase.mapreduce.RowCounter}
 or review the <code>org.apache.hadoop.hbase.mapreduce.TestTableMapReduce</code> unit test.
 </p>
 <p>Running mapreduce jobs that have HBase as source or sink, you'll need to
 specify source/sink table and column names in your configuration.</p>
 <p>Reading from HBase, the TableInputFormat asks HBase for the list of
 regions and makes a map-per-region or <code>mapreduce.job.maps maps</code>,
 whichever is smaller (If your job only has two maps, up mapreduce.job.maps
 to a number &gt; number of regions). Maps will run on the adjacent TaskTracker
 if you are running a TaskTracer and RegionServer per node.
 Writing, it may make sense to avoid the reduce step and write yourself back into
 HBase from inside your map. You'd do this when your job does not need the sort
 and collation that mapreduce does on the map emitted data; on insert,
 HBase 'sorts' so there is no point double-sorting (and shuffling data around
 your mapreduce cluster) unless you need to. If you do not need the reduce,
 you might just have your map emit counts of records processed just so the
 framework's report at the end of your job has meaning or set the number of
 reduces to zero and use TableOutputFormat. See example code
 below. If running the reduce step makes sense in your case, its usually better
 to have lots of reducers so load is spread across the HBase cluster.</p>
 <p>There is also a new HBase partitioner that will run as many reducers as
 currently existing regions.  The
 {@link org.apache.hadoop.hbase.mapreduce.HRegionPartitioner} is suitable
 when your table is large and your upload is not such that it will greatly
 alter the number of existing regions when done; otherwise use the default
 partitioner.
 </p>
 <h2><a name="bulk">Bulk import writing HFiles directly</a></h2>
 <p>If importing into a new table, its possible to by-pass the HBase API
 and write your content directly to the filesystem properly formatted as
 HBase data files (HFiles).  Your import will run faster, perhaps an order of
 magnitude faster if not more.  For more on how this mechanism works, see
 <a href="http://hbase.apache.org/bulk-loads.html">Bulk Loads</code>
 documentation.
 </p>
 <h2><a name="examples">Example Code</a></h2>
 <h3>Sample Row Counter</h3>
 <p>See {@link org.apache.hadoop.hbase.mapreduce.RowCounter}.  This job uses
 {@link org.apache.hadoop.hbase.mapreduce.TableInputFormat TableInputFormat} and
 does a count of all rows in specified table.
 You should be able to run
 it by doing: <code>% ./bin/hadoop jar hbase-X.X.X.jar</code>.  This will invoke
 the hbase MapReduce Driver class.  Select 'rowcounter' from the choice of jobs
 offered. This will emit rowcouner 'usage'.  Specify tablename, column to count
 and output directory.  You may need to add the hbase conf directory to <code>$HADOOP_HOME/conf/hadoop-env.sh#HADOOP_CLASSPATH</code>
 so the rowcounter gets pointed at the right hbase cluster (or, build a new jar
 with an appropriate hbase-site.xml built into your job jar).
 </p>
 */
 package org.apache.hadoop.hbase.mapreduce;
--- a/src/main/docbkx/book.xml
+++ b/src/main/docbkx/book.xml
@ -664,18 +664,76 @@ htable.put(put);
  <!--  schema design -->
  <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="schema_design.xml"/>
-  <chapter xml:id="mapreduce">
+  <chapter
-  <title>HBase and MapReduce</title>
+    xml:id="mapreduce">
-  <para>See <link xlink:href="http://hbase.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#package_description">
+    <title>HBase and MapReduce</title>
-  HBase and MapReduce</link> up in javadocs.
+    <para>Apache MapReduce is a software framework used to analyze large amounts of data, and is
-  Start there.  Below is some additional help.</para>
+      the framework used most often with <link
-  <para>For more information about MapReduce (i.e., the framework in general), see the Hadoop site (TODO: Need good links here --
+        xlink:href="http://hadoop.apache.org/">Apache Hadoop</link>. MapReduce itself is out of the
-      we used to have some but they rotted against apache hadoop).</para>
+      scope of this document. A good place to get started with MapReduce is <link
-  <caution>
+        xlink:href="http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html" />. MapReduce version
-    <title>Notice to Mapreduce users of HBase 0.96.1 and above</title>
+      2 (MR2)is now part of <link
-    <para>Some mapreduce jobs that use HBase fail to launch. The symptom is an
+        xlink:href="http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/">YARN</link>. </para>
-    exception similar to the following:
+
-    <programlisting>
+    <para> This chapter discusses specific configuration steps you need to take to use MapReduce on
      data within HBase. In addition, it discusses other interactions and issues between HBase and
      MapReduce jobs.
      <note> 
      <title>mapred and mapreduce</title>
      <para>There are two mapreduce packages in HBase as in MapReduce itself: <filename>org.apache.hadoop.hbase.mapred</filename>
      and <filename>org.apache.hadoop.hbase.mapreduce</filename>. The former does old-style API and the latter
      the new style.  The latter has more facility though you can usually find an equivalent in the older
      package.  Pick the package that goes with your mapreduce deploy.  When in doubt or starting over, pick the
      <filename>org.apache.hadoop.hbase.mapreduce</filename>.  In the notes below, we refer to
      o.a.h.h.mapreduce but replace with the o.a.h.h.mapred if that is what you are using.
      </para>
      </note> 
    </para>
    <section
      xml:id="hbase.mapreduce.classpath">
      <title>HBase, MapReduce, and the CLASSPATH</title>
      <para>Ny default, MapReduce jobs deployed to a MapReduce cluster do not have access to either
        the HBase configuration under <envar>$HBASE_CONF_DIR</envar> or the HBase classes.</para>
      <para>To give the MapReduce jobs the access they need, you could add
          <filename>hbase-site.xml</filename> to the
            <filename><replaceable>$HADOOP_HOME</replaceable>/conf/</filename> directory and add the
        HBase JARs to the <filename><replaceable>HADOOP_HOME</replaceable>/conf/</filename>
        directory, then copy these changes across your cluster. You could add hbase-site.xml to
        $HADOOP_HOME/conf and add HBase jars to the $HADOOP_HOME/lib. You would then need to copy
        these changes across your cluster or edit
          <filename><replaceable>$HADOOP_HOME</replaceable>conf/hadoop-env.sh</filename> and add
        them to the <envar>HADOOP_CLASSPATH</envar> variable. However, this approach is not
        recommended because it will pollute your Hadoop install with HBase references. It also
        requires you to restart the Hadoop cluster before Hadoop can use the HBase data.</para>
      <para> Since HBase 0.90.x, HBase adds its dependency JARs to the job configuration itself. The
        dependencies only need to be available on the local CLASSPATH. The following example runs
        the bundled HBase <link
          xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html">RowCounter</link>
        MapReduce job against a table named <systemitem>usertable</systemitem> If you have not set
        the environment variables expected in the command (the parts prefixed by a
          <literal>$</literal> sign and curly braces), you can use the actual system paths instead.
        Be sure to use the correct version of the HBase JAR for your system. The backticks
          (<literal>`</literal> symbols) cause ths shell to execute the sub-commands, setting the
        CLASSPATH as part of the command. This example assumes you use a BASH-compatible shell. </para>
      <screen>$ <userinput>HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-0.90.0.jar rowcounter usertable</userinput></screen>
      <para>When the command runs, internally, the HBase JAR finds the dependencies it needs for
        zookeeper, guava, and its other dependencies on the passed <envar>HADOOP_CLASSPATH</envar>
        and adds the JARs to the MapReduce job configuration. See the source at
        TableMapReduceUtil#addDependencyJars(org.apache.hadoop.mapreduce.Job) for how this is done. </para>
      <note>
        <para> The example may not work if you are running HBase from its build directory rather
          than an installed location. You may see an error like the following:</para>
        <screen>java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.mapreduce.RowCounter$RowCounterMapper</screen>
        <para>If this occurs, try modifying the command as follows, so that it uses the HBase JARs
          from the <filename>target/</filename> directory within the build environment.</para>
        <screen>$ <userinput>HADOOP_CLASSPATH=${HBASE_HOME}/target/hbase-0.90.0-SNAPSHOT.jar:`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/target/hbase-0.90.0-SNAPSHOT.jar rowcounter usertable</userinput></screen>
      </note>
      <caution>
        <title>Notice to Mapreduce users of HBase 0.96.1 and above</title>
        <para>Some mapreduce jobs that use HBase fail to launch. The symptom is an exception similar
          to the following:</para>
        <screen>
 Exception in thread "main" java.lang.IllegalAccessError: class
    com.google.protobuf.ZeroCopyLiteralByteString cannot access its superclass
    com.google.protobuf.LiteralByteString
@ -703,63 +761,158 @@ Exception in thread "main" java.lang.IllegalAccessError: class
    at
    org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:100)
 ...
-</programlisting>
+</screen>
-    This is because of an optimization introduced in <link
+        <para>This is caused by an optimization introduced in <link
-    xlink:href="https://issues.apache.org/jira/browse/HBASE-9867">HBASE-9867</link>
+            xlink:href="https://issues.apache.org/jira/browse/HBASE-9867">HBASE-9867</link> that
-    that inadvertently introduced a classloader dependency.
+          inadvertently introduced a classloader dependency. </para>
-    </para>
+        <para>This affects both jobs using the <code>-libjars</code> option and "fat jar," those
-    <para>This affects both jobs using the <code>-libjars</code> option and
+          which package their runtime dependencies in a nested <code>lib</code> folder.</para>
-    "fat jar," those which package their runtime dependencies in a nested
+        <para>In order to satisfy the new classloader requirements, hbase-protocol.jar must be
-    <code>lib</code> folder.</para>
+          included in Hadoop's classpath. See <xref
-    <para>In order to satisfy the new classloader requirements,
+            linkend="hbase.mapreduce.classpath" /> for current recommendations for resolving
-    hbase-protocol.jar must be included in Hadoop's classpath. This can be
+          classpath errors. The following is included for historical purposes.</para>
-    resolved system-wide by including a reference to the hbase-protocol.jar in
+        <para>This can be resolved system-wide by including a reference to the hbase-protocol.jar in
-    hadoop's lib directory, via a symlink or by copying the jar into the new
+          hadoop's lib directory, via a symlink or by copying the jar into the new location.</para>
-    location.</para>
+        <para>This can also be achieved on a per-job launch basis by including it in the
-    <para>This can also be achieved on a per-job launch basis by including it
+            <code>HADOOP_CLASSPATH</code> environment variable at job submission time. When
-    in the <code>HADOOP_CLASSPATH</code> environment variable at job submission
+          launching jobs that package their dependencies, all three of the following job launching
-    time. When launching jobs that package their dependencies, all three of the
+          commands satisfy this requirement:</para>
-    following job launching commands satisfy this requirement:</para>
+        <screen>
-<programlisting>
+$ <userinput>HADOOP_CLASSPATH=/path/to/hbase-protocol.jar:/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass</userinput>
-$ HADOOP_CLASSPATH=/path/to/hbase-protocol.jar:/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass
+$ <userinput>HADOOP_CLASSPATH=$(hbase mapredcp):/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass</userinput>
-$ HADOOP_CLASSPATH=$(hbase mapredcp):/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass
+$ <userinput>HADOOP_CLASSPATH=$(hbase classpath) hadoop jar MyJob.jar MyJobMainClass</userinput>
-$ HADOOP_CLASSPATH=$(hbase classpath) hadoop jar MyJob.jar MyJobMainClass
+        </screen>
-</programlisting>
+        <para>For jars that do not package their dependencies, the following command structure is
-  <para>For jars that do not package their dependencies, the following command
+          necessary:</para>
-  structure is necessary:</para>
+        <screen>
-<programlisting>
+$ <userinput>HADOOP_CLASSPATH=$(hbase mapredcp):/etc/hbase/conf hadoop jar MyApp.jar MyJobMainClass -libjars $(hbase mapredcp | tr ':' ',')</userinput> ...
-$ HADOOP_CLASSPATH=$(hbase mapredcp):/etc/hbase/conf hadoop jar MyApp.jar MyJobMainClass -libjars $(hbase mapredcp | tr ':' ',') ...
+        </screen>
-</programlisting>
+        <para>See also <link
-  <para>See also <link
+            xlink:href="https://issues.apache.org/jira/browse/HBASE-10304">HBASE-10304</link> for
-  xlink:href="https://issues.apache.org/jira/browse/HBASE-10304">HBASE-10304</link>
+          further discussion of this issue.</para>
-  for further discussion of this issue.</para>
+      </caution>
  </caution>
  <section xml:id="splitter">
    <title>Map-Task Splitting</title>
    <section xml:id="splitter.default">
    <title>The Default HBase MapReduce Splitter</title>
    <para>When <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html">TableInputFormat</link>
    is used to source an HBase table in a MapReduce job,
    its splitter will make a map task for each region of the table.
    Thus, if there are 100 regions in the table, there will be
    100 map-tasks for the job - regardless of how many column families are selected in the Scan.</para>
    </section>
-    <section xml:id="splitter.custom">
+
-    <title>Custom Splitters</title>
+
-    <para>For those interested in implementing custom splitters, see the method <code>getSplits</code> in
+    <section>
-    <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.html">TableInputFormatBase</link>.
+      <title>Bundled HBase MapReduce Jobs</title>
-    That is where the logic for map-task assignment resides.
+      <para>The HBase JAR also serves as a Driver for some bundled mapreduce jobs. To learn about
-    </para>
+        the bundled MapReduce jobs, run the following command.</para>
      <screen>$ <userinput>${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-0.90.0-SNAPSHOT.jar</userinput>
 <computeroutput>An example program must be given as the first argument.
 Valid program names are:
  copytable: Export a table from local cluster to peer cluster
  completebulkload: Complete a bulk data load.
  export: Write table data to HDFS.
  import: Import data written by Export.
  importtsv: Import data in TSV format.
  rowcounter: Count rows in HBase table</computeroutput>
    </screen>
      <para>Each of the valid program names are bundled MapReduce jobs. To run one of the jobs,
        model your command after the following example.</para>
      <screen>$ <userinput>${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-0.90.0-SNAPSHOT.jar rowcounter myTable</userinput></screen>
    </section>
-  </section>
+
-  <section xml:id="mapreduce.example">
+    <section>
-    <title>HBase MapReduce Examples</title>
+      <title>HBase as a MapReduce Job Data Source and Data Sink</title>
-    <section xml:id="mapreduce.example.read">
+      <para>HBase can be used as a data source, <link
-    <title>HBase MapReduce Read Example</title>
+          xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html">TableInputFormat</link>,
-    <para>The following is an example of using HBase as a MapReduce source in read-only manner.  Specifically,
+        and data sink, <link
-    there is a Mapper instance but no Reducer, and nothing is being emitted from the Mapper.  There job would be defined
+          xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html">TableOutputFormat</link>
-    as follows...
+        or <link
-	<programlisting>
+          xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/MultiTableOutputFormat.html">MultiTableOutputFormat</link>,
        for MapReduce jobs. Writing MapReduce jobs that read or write HBase, it is advisable to
        subclass <link
          xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableMapper.html">TableMapper</link>
        and/or <link
          xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableReducer.html">TableReducer</link>.
        See the do-nothing pass-through classes <link
          xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/IdentityTableMapper.html">IdentityTableMapper</link>
        and <link
          xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/IdentityTableReducer.html">IdentityTableReducer</link>
        for basic usage. For a more involved example, see <link
          xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html">RowCounter</link>
        or review the <code>org.apache.hadoop.hbase.mapreduce.TestTableMapReduce</code> unit test. </para>
      <para>If you run MapReduce jobs that use HBase as source or sink, need to specify source and
        sink table and column names in your configuration.</para>
      <para>When you read from HBase, the <code>TableInputFormat</code> requests the list of regions
        from HBase and makes a map, which is either a <code>map-per-region</code> or
          <code>mapreduce.job.maps</code> map, whichever is smaller. If your job only has two maps,
        raise <code>mapreduce.job.maps</code> to a number greater than the number of regions. Maps
        will run on the adjacent TaskTracker if you are running a TaskTracer and RegionServer per
        node. When writing to HBase, it may make sense to avoid the Reduce step and write back into
        HBase from within your map. This approach works when your job does not need the sort and
        collation that MapReduce does on the map-emitted data. On insert, HBase 'sorts' so there is
        no point double-sorting (and shuffling data around your MapReduce cluster) unless you need
        to. If you do not need the Reduce, you myour map might emit counts of records processed for
        reporting at the end of the jobj, or set the number of Reduces to zero and use
        TableOutputFormat. If running the Reduce step makes sense in your case, you should typically
        use multiple reducers so that load is spread across the HBase cluster.</para>
      <para>A new HBase partitioner, the <link
          xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/HRegionPartitioner.html">HRegionPartitioner</link>,
        can run as many reducers the number of existing regions. The HRegionPartitioner is suitable
        when your table is large and your upload will not greatly alter the number of existing
        regions upon completion. Otherwise use the default partitioner. </para>
    </section>
    <section>
      <title>Writing HFiles Directly During Bulk Import</title>
      <para>If you are importing into a new table, you can bypass the HBase API and write your
        content directly to the filesystem, formatted into HBase data files (HFiles). Your import
        will run faster, perhaps an order of magnitude faster. For more on how this mechanism works,
        see <xref
          linkend="arch.bulk.load" />.</para>
    </section>
    <section>
      <title>RowCounter Example</title>
      <para>The included <link
        xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html">RowCounter</link>
        MapReduce job uses <code>TableInputFormat</code> and does a count of all rows in the specified
        table. To run it, use the following command: </para>
      <screen>$ <userinput>./bin/hadoop jar hbase-X.X.X.jar</userinput></screen> 
      <para>This will
        invoke the HBase MapReduce Driver class. Select <literal>rowcounter</literal> from the choice of jobs
        offered. This will print rowcouner usage advice to standard output. Specify the tablename,
        column to count, and output
        directory. If you have classpath errors, see <xref linkend="hbase.mapreduce.classpath" />.</para>
    </section>
    <section
      xml:id="splitter">
      <title>Map-Task Splitting</title>
      <section
        xml:id="splitter.default">
        <title>The Default HBase MapReduce Splitter</title>
        <para>When <link
            xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html">TableInputFormat</link>
          is used to source an HBase table in a MapReduce job, its splitter will make a map task for
          each region of the table. Thus, if there are 100 regions in the table, there will be 100
          map-tasks for the job - regardless of how many column families are selected in the
          Scan.</para>
      </section>
      <section
        xml:id="splitter.custom">
        <title>Custom Splitters</title>
        <para>For those interested in implementing custom splitters, see the method
            <code>getSplits</code> in <link
            xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.html">TableInputFormatBase</link>.
          That is where the logic for map-task assignment resides. </para>
      </section>
    </section>
    <section
      xml:id="mapreduce.example">
      <title>HBase MapReduce Examples</title>
      <section
        xml:id="mapreduce.example.read">
        <title>HBase MapReduce Read Example</title>
        <para>The following is an example of using HBase as a MapReduce source in read-only manner.
          Specifically, there is a Mapper instance but no Reducer, and nothing is being emitted from
          the Mapper. There job would be defined as follows...</para>
        <programlisting>
 Configuration config = HBaseConfiguration.create();
 Job job = new Job(config, "ExampleRead");
 job.setJarByClass(MyReadJob.class);     // class that contains mapper
@ -784,8 +937,9 @@ if (!b) {
  throw new IOException("error with job!");
 }
  </programlisting>
-  ...and the mapper instance would extend <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableMapper.html">TableMapper</link>...
+        <para>...and the mapper instance would extend <link
-	<programlisting>
+            xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableMapper.html">TableMapper</link>...</para>
        <programlisting>
 public static class MyMapper extends TableMapper&lt;Text, Text&gt; {
  public void map(ImmutableBytesWritable row, Result value, Context context) throws InterruptedException, IOException {
@ -793,13 +947,13 @@ public static class MyMapper extends TableMapper&lt;Text, Text&gt; {
   }
 }
    </programlisting>
-  	  </para>
+      </section>
-  	 </section>
+      <section
-    <section xml:id="mapreduce.example.readwrite">
+        xml:id="mapreduce.example.readwrite">
-    <title>HBase MapReduce Read/Write Example</title>
+        <title>HBase MapReduce Read/Write Example</title>
-    <para>The following is an example of using HBase both as a source and as a sink with MapReduce.
+        <para>The following is an example of using HBase both as a source and as a sink with
-    This example will simply copy data from one table to another.</para>
+          MapReduce. This example will simply copy data from one table to another.</para>
-    <programlisting>
+        <programlisting>
 Configuration config = HBaseConfiguration.create();
 Job job = new Job(config,"ExampleReadWrite");
 job.setJarByClass(MyReadWriteJob.class);    // class that contains mapper
@ -827,15 +981,18 @@ if (!b) {
    throw new IOException("error with job!");
 }
    </programlisting>
-	<para>An explanation is required of what <classname>TableMapReduceUtil</classname> is doing, especially with the reducer.
+        <para>An explanation is required of what <classname>TableMapReduceUtil</classname> is doing,
-	<link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html">TableOutputFormat</link> is being used
+          especially with the reducer. <link
-	as the outputFormat class, and several parameters are being set on the config (e.g., TableOutputFormat.OUTPUT_TABLE), as
+            xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html">TableOutputFormat</link>
-	well as setting the reducer output key to <classname>ImmutableBytesWritable</classname> and reducer value to <classname>Writable</classname>.
+          is being used as the outputFormat class, and several parameters are being set on the
-	These could be set by the programmer on the job and conf, but <classname>TableMapReduceUtil</classname> tries to make things easier.</para>
+          config (e.g., TableOutputFormat.OUTPUT_TABLE), as well as setting the reducer output key
-	<para>The following is the example mapper, which will create a <classname>Put</classname> and matching the input <classname>Result</classname>
+          to <classname>ImmutableBytesWritable</classname> and reducer value to
-	and emit it.  Note:  this is what the CopyTable utility does.
+            <classname>Writable</classname>. These could be set by the programmer on the job and
-	</para>
+          conf, but <classname>TableMapReduceUtil</classname> tries to make things easier.</para>
-    <programlisting>
+        <para>The following is the example mapper, which will create a <classname>Put</classname>
          and matching the input <classname>Result</classname> and emit it. Note: this is what the
          CopyTable utility does. </para>
        <programlisting>
 public static class MyMapper extends TableMapper&lt;ImmutableBytesWritable, Put&gt;  {
 	public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {
@ -852,23 +1009,24 @@ public static class MyMapper extends TableMapper&lt;ImmutableBytesWritable, Put&
   	}
 }
    </programlisting>
-    <para>There isn't actually a reducer step, so <classname>TableOutputFormat</classname> takes care of sending the <classname>Put</classname>
+        <para>There isn't actually a reducer step, so <classname>TableOutputFormat</classname> takes
-    to the target table.
+          care of sending the <classname>Put</classname> to the target table. </para>
-    </para>
+        <para>This is just an example, developers could choose not to use
-    <para>This is just an example, developers could choose not to use <classname>TableOutputFormat</classname> and connect to the
+            <classname>TableOutputFormat</classname> and connect to the target table themselves.
-    target table themselves.
+        </para>
-    </para>
+      </section>
-    </section>
+      <section
-    <section xml:id="mapreduce.example.readwrite.multi">
+        xml:id="mapreduce.example.readwrite.multi">
-      <title>HBase MapReduce Read/Write Example With Multi-Table Output</title>
+        <title>HBase MapReduce Read/Write Example With Multi-Table Output</title>
-      <para>TODO:  example for <classname>MultiTableOutputFormat</classname>.
+        <para>TODO: example for <classname>MultiTableOutputFormat</classname>. </para>
-    </para>
+      </section>
-    </section>
+      <section
-    <section xml:id="mapreduce.example.summary">
+        xml:id="mapreduce.example.summary">
-    <title>HBase MapReduce Summary to HBase Example</title>
+        <title>HBase MapReduce Summary to HBase Example</title>
-    <para>The following example uses HBase as a MapReduce source and sink with a summarization step.  This example will
+        <para>The following example uses HBase as a MapReduce source and sink with a summarization
-    count the number of distinct instances of a value in a table and write those summarized counts in another table.
+          step. This example will count the number of distinct instances of a value in a table and
-    <programlisting>
+          write those summarized counts in another table.
          <programlisting>
 Configuration config = HBaseConfiguration.create();
 Job job = new Job(config,"ExampleSummary");
 job.setJarByClass(MySummaryJob.class);     // class that contains mapper and reducer
@ -896,9 +1054,10 @@ if (!b) {
 	throw new IOException("error with job!");
 }
    </programlisting>
-    In this example mapper a column with a String-value is chosen as the value to summarize upon.
+          In this example mapper a column with a String-value is chosen as the value to summarize
-    This value is used as the key to emit from the mapper, and an <classname>IntWritable</classname> represents an instance counter.
+          upon. This value is used as the key to emit from the mapper, and an
-    <programlisting>
+            <classname>IntWritable</classname> represents an instance counter.
          <programlisting>
 public static class MyMapper extends TableMapper&lt;Text, IntWritable&gt;  {
 	public static final byte[] CF = "cf".getBytes();
 	public static final byte[] ATTR1 = "attr1".getBytes();
@ -914,8 +1073,9 @@ public static class MyMapper extends TableMapper&lt;Text, IntWritable&gt;  {
   	}
 }
    </programlisting>
-    In the reducer, the "ones" are counted (just like any other MR example that does this), and then emits a <classname>Put</classname>.
+          In the reducer, the "ones" are counted (just like any other MR example that does this),
-    <programlisting>
+          and then emits a <classname>Put</classname>.
          <programlisting>
 public static class MyTableReducer extends TableReducer&lt;Text, IntWritable, ImmutableBytesWritable&gt;  {
 	public static final byte[] CF = "cf".getBytes();
 	public static final byte[] COUNT = "count".getBytes();
@ -932,14 +1092,15 @@ public static class MyTableReducer extends TableReducer&lt;Text, IntWritable, Im
   	}
 }
    </programlisting>
-    </para>
+        </para>
-    </section>
+      </section>
-    <section xml:id="mapreduce.example.summary.file">
+      <section
-    <title>HBase MapReduce Summary to File Example</title>
+        xml:id="mapreduce.example.summary.file">
-       <para>This very similar to the summary example above, with exception that this is using HBase as a MapReduce source
+        <title>HBase MapReduce Summary to File Example</title>
-       but HDFS as the sink.  The differences are in the job setup and in the reducer.  The mapper remains the same.
+        <para>This very similar to the summary example above, with exception that this is using
-       </para>
+          HBase as a MapReduce source but HDFS as the sink. The differences are in the job setup and
-    <programlisting>
+          in the reducer. The mapper remains the same. </para>
        <programlisting>
 Configuration config = HBaseConfiguration.create();
 Job job = new Job(config,"ExampleSummaryToFile");
 job.setJarByClass(MySummaryFileJob.class);     // class that contains mapper and reducer
@ -965,9 +1126,10 @@ if (!b) {
 	throw new IOException("error with job!");
 }
    </programlisting>
-    <para>As stated above, the previous Mapper can run unchanged with this example.
+        <para>As stated above, the previous Mapper can run unchanged with this example. As for the
-    As for the Reducer, it is a "generic" Reducer instead of extending TableMapper and emitting Puts.</para>
+          Reducer, it is a "generic" Reducer instead of extending TableMapper and emitting
-    <programlisting>
+          Puts.</para>
        <programlisting>
 public static class MyReducer extends Reducer&lt;Text, IntWritable, Text, IntWritable&gt;  {
 	public void reduce(Text key, Iterable&lt;IntWritable&gt; values, Context context) throws IOException, InterruptedException {
@ -979,33 +1141,35 @@ if (!b) {
 	}
 }
    </programlisting>
-   </section>
+      </section>
-   <section xml:id="mapreduce.example.summary.noreducer">
+      <section
-    <title>HBase MapReduce Summary to HBase Without Reducer</title>
+        xml:id="mapreduce.example.summary.noreducer">
-       <para>It is also possible to perform summaries without a reducer - if you use HBase as the reducer.
+        <title>HBase MapReduce Summary to HBase Without Reducer</title>
-       </para>
+        <para>It is also possible to perform summaries without a reducer - if you use HBase as the
-       <para>An HBase target table would need to exist for the job summary.  The HTable method <code>incrementColumnValue</code>
+          reducer. </para>
-       would be used to atomically increment values.  From a performance perspective, it might make sense to keep a Map
+        <para>An HBase target table would need to exist for the job summary. The HTable method
-       of values with their values to be incremeneted for each map-task, and make one update per key at during the <code>
+            <code>incrementColumnValue</code> would be used to atomically increment values. From a
-       cleanup</code> method of the mapper.  However, your milage may vary depending on the number of rows to be processed and
+          performance perspective, it might make sense to keep a Map of values with their values to
-       unique keys.
+          be incremeneted for each map-task, and make one update per key at during the <code>
-       </para>
+            cleanup</code> method of the mapper. However, your milage may vary depending on the
-       <para>In the end, the summary results are in HBase.
+          number of rows to be processed and unique keys. </para>
-       </para>
+        <para>In the end, the summary results are in HBase. </para>
-   </section>
+      </section>
-   <section xml:id="mapreduce.example.summary.rdbms">
+      <section
-    <title>HBase MapReduce Summary to RDBMS</title>
+        xml:id="mapreduce.example.summary.rdbms">
-       <para>Sometimes it is more appropriate to generate summaries to an RDBMS.  For these cases, it is possible
+        <title>HBase MapReduce Summary to RDBMS</title>
-       to generate summaries directly to an RDBMS via a custom reducer.  The <code>setup</code> method
+        <para>Sometimes it is more appropriate to generate summaries to an RDBMS. For these cases,
-       can connect to an RDBMS (the connection information can be passed via custom parameters in the context) and the
+          it is possible to generate summaries directly to an RDBMS via a custom reducer. The
-       cleanup method can close the connection.
+            <code>setup</code> method can connect to an RDBMS (the connection information can be
-       </para>
+          passed via custom parameters in the context) and the cleanup method can close the
-       <para>It is critical to understand that number of reducers for the job affects the summarization implementation, and
+          connection. </para>
-       you'll have to design this into your reducer.  Specifically, whether it is designed to run as a singleton (one reducer)
+        <para>It is critical to understand that number of reducers for the job affects the
-       or multiple reducers.  Neither is right or wrong, it depends on your use-case.  Recognize that the more reducers that
+          summarization implementation, and you'll have to design this into your reducer.
-       are assigned to the job, the more simultaneous connections to the RDBMS will be created - this will scale, but only to a point.
+          Specifically, whether it is designed to run as a singleton (one reducer) or multiple
-       </para>
+          reducers. Neither is right or wrong, it depends on your use-case. Recognize that the more
-    <programlisting>
+          reducers that are assigned to the job, the more simultaneous connections to the RDBMS will
          be created - this will scale, but only to a point. </para>
        <programlisting>
 public static class MyRdbmsReducer extends Reducer&lt;Text, IntWritable, Text, IntWritable&gt;  {
 	private Connection c = null;
@ -1025,18 +1189,18 @@ if (!b) {
 }
    </programlisting>
-       <para>In the end, the summary results are written to your RDBMS table/s.
+        <para>In the end, the summary results are written to your RDBMS table/s. </para>
-       </para>
+      </section>
   </section>
-   </section> <!--  mr examples -->
+    </section>
-   <section xml:id="mapreduce.htable.access">
+    <!--  mr examples -->
-   <title>Accessing Other HBase Tables in a MapReduce Job</title>
+    <section
-	<para>Although the framework currently allows one HBase table as input to a
+      xml:id="mapreduce.htable.access">
-    MapReduce job, other HBase tables can
+      <title>Accessing Other HBase Tables in a MapReduce Job</title>
-	be accessed as lookup tables, etc., in a
+      <para>Although the framework currently allows one HBase table as input to a MapReduce job,
-    MapReduce job via creating an HTable instance in the setup method of the Mapper.
+        other HBase tables can be accessed as lookup tables, etc., in a MapReduce job via creating
-	<programlisting>public class MyMapper extends TableMapper&lt;Text, LongWritable&gt; {
+        an HTable instance in the setup method of the Mapper.
        <programlisting>public class MyMapper extends TableMapper&lt;Text, LongWritable&gt; {
  private HTable myOtherTable;
  public void setup(Context context) {
@ -1049,20 +1213,19 @@ if (!b) {
  }
  </programlisting>
-   </para>
+      </para>
    </section>
    <section
      xml:id="mapreduce.specex">
      <title>Speculative Execution</title>
      <para>It is generally advisable to turn off speculative execution for MapReduce jobs that use
        HBase as a source. This can either be done on a per-Job basis through properties, on on the
        entire cluster. Especially for longer running jobs, speculative execution will create
        duplicate map-tasks which will double-write your data to HBase; this is probably not what
        you want. </para>
      <para>See <xref
          linkend="spec.ex" /> for more information. </para>
    </section>
  <section xml:id="mapreduce.specex">
  <title>Speculative Execution</title>
  <para>It is generally advisable to turn off speculative execution for
      MapReduce jobs that use HBase as a source.  This can either be done on a
      per-Job basis through properties, on on the entire cluster.  Especially
      for longer running jobs, speculative execution will create duplicate
      map-tasks which will double-write your data to HBase; this is probably
      not what you want.
  </para>
  <para>See <xref linkend="spec.ex"/> for more information.
  </para>
  </section>
  </chapter>  <!--  mapreduce -->
  <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="security.xml" />