HBASE-1698 Review documentation for o.a.h.h.mapreduce

git-svn-id: https://svn.apache.org/repos/asf/hadoop/hbase/trunk@808144 13f79535-47bb-0310-9956-ffa450edef68
2009-08-26 18:13:04 +00:00 · 2009-08-26 18:13:04 +00:00 · 210a157a95
parent de546af646
commit 210a157a95
2 changed files with 22 additions and 168 deletions
--- a/CHANGES.txt
+++ b/CHANGES.txt
@ -8,6 +8,7 @@ Release 0.21.0 - Unreleased
   HBASE-1737  Regions unbalanced when adding new node (recommit)
   HBASE-1792  [Regression] Cannot save timestamp in the future
   HBASE-1793  [Regression] HTable.get/getRow with a ts is broken
+   HBASE-1698  Review documentation for o.a.h.h.mapreduce

  IMPROVEMENTS
   HBASE-1760  Cleanup TODOs in HTable
--- a/src/java/org/apache/hadoop/hbase/mapreduce/package-info.java
+++ b/src/java/org/apache/hadoop/hbase/mapreduce/package-info.java
@ -33,41 +33,34 @@ Input/OutputFormats, a table indexing MapReduce job, and utility
 <p>MapReduce jobs deployed to a MapReduce cluster do not by default have access
 to the HBase configuration under <code>$HBASE_CONF_DIR</code> nor to HBase classes.
 You could add <code>hbase-site.xml</code> to $HADOOP_HOME/conf and add
-<code>hbase-X.X.X.jar</code> to the <code>$HADOOP_HOME/lib</code> and copy these
-changes across your cluster but the cleanest means of adding hbase configuration
+hbase jars to the <code>$HADOOP_HOME/lib</code> and copy these
+changes across your cluster but a cleaner means of adding hbase configuration
 and classes to the cluster <code>CLASSPATH</code> is by uncommenting
 <code>HADOOP_CLASSPATH</code> in <code>$HADOOP_HOME/conf/hadoop-env.sh</code>
-and adding the path to the hbase jar and <code>$HBASE_CONF_DIR</code> directory.
-Then copy the amended configuration around the cluster.
-You'll probably need to restart the MapReduce cluster if you want it to notice
-the new configuration.
-</p>
-
-<p>For example, here is how you would amend <code>hadoop-env.sh</code> adding the
-built hbase jar, hbase conf, and the <code>PerformanceEvaluation</code> class from
-the built hbase test jar to the hadoop <code>CLASSPATH<code>:
+adding hbase dependencies here.  For example, here is how you would amend
+<code>hadoop-env.sh</code> adding the
+built hbase jar, zookeeper (needed by hbase client), hbase conf, and the
+<code>PerformanceEvaluation</code> class from the built hbase test jar to the
+hadoop <code>CLASSPATH<code>:

 <blockquote><pre># Extra Java CLASSPATH elements. Optional.
 # export HADOOP_CLASSPATH=
-export HADOOP_CLASSPATH=$HBASE_HOME/build/test:$HBASE_HOME/build/hbase-X.X.X.jar:$HBASE_HOME/build/hbase-X.X.X-test.jar:$HBASE_HOME/conf</pre></blockquote>
+export HADOOP_CLASSPATH=$HBASE_HOME/build/hbase-X.X.X.jar:$HBASE_HOME/build/hbase-X.X.X-test.jar:$HBASE_HOME/conf:${HBASE_HOME}/lib/zookeeper-X.X.X.jar</pre></blockquote>

 <p>Expand <code>$HBASE_HOME</code> in the above appropriately to suit your
 local environment.</p>

-<p>After copying the above change around your cluster, this is how you would run
-the PerformanceEvaluation MR job to put up 4 clients (Presumes a ready mapreduce
-cluster):
+<p>After copying the above change around your cluster (and restarting), this is
+how you would run the PerformanceEvaluation MR job to put up 4 clients (Presumes
+a ready mapreduce cluster):

 <blockquote><pre>$HADOOP_HOME/bin/hadoop org.apache.hadoop.hbase.PerformanceEvaluation sequentialWrite 4</pre></blockquote>
-
-The PerformanceEvaluation class wil be found on the CLASSPATH because you
-added <code>$HBASE_HOME/build/test</code> to HADOOP_CLASSPATH
 </p>

 <p>Another possibility, if for example you do not have access to hadoop-env.sh or
-are unable to restart the hadoop cluster, is bundling the hbase jar into a mapreduce
+are unable to restart the hadoop cluster, is bundling the hbase jars into a mapreduce
 job jar adding it and its dependencies under the job jar <code>lib/</code>
-directory and the hbase conf into a job jar <code>conf/</code> directory.
+directory and the hbase conf into the job jars top-level directory.
 </a>

 <h2><a name="sink">HBase as MapReduce job data source and sink</a></h2>
@ -79,7 +72,7 @@ Writing MapReduce jobs that read or write HBase, you'll probably want to subclas
 {@link org.apache.hadoop.hbase.mapreduce.TableReducer TableReducer}.  See the do-nothing
 pass-through classes {@link org.apache.hadoop.hbase.mapreduce.IdentityTableMapper IdentityTableMapper} and
 {@link org.apache.hadoop.hbase.mapreduce.IdentityTableReducer IdentityTableReducer} for basic usage.  For a more
-involved example, see {@link org.apache.hadoop.hbase.mapreduce.BuildTableIndex BuildTableIndex}
+involved example, see {@link org.apache.hadoop.hbase.mapreduce.RowCounter}
 or review the <code>org.apache.hadoop.hbase.mapreduce.TestTableMapReduce</code> unit test.
 </p>

@ -106,162 +99,22 @@ to have lots of reducers so load is spread across the hbase cluster.</p>
 currently existing regions.  The 
 {@link org.apache.hadoop.hbase.mapreduce.HRegionPartitioner} is suitable
 when your table is large and your upload is not such that it will greatly
-alter the number of existing regions when done; other use the default
+alter the number of existing regions when done; otherwise use the default
 partitioner.
 </p>

 <h2><a name="examples">Example Code</a></h2>
 <h3>Sample Row Counter</h3>
-<p>See {@link org.apache.hadoop.hbase.mapreduce.RowCounter}.  You should be able to run
+<p>See {@link org.apache.hadoop.hbase.mapreduce.RowCounter}.  This job uses
+{@link org.apache.hadoop.hbase.mapreduce.TableInputFormat TableInputFormat} and
+does a count of all rows in specified table.
+You should be able to run
 it by doing: <code>% ./bin/hadoop jar hbase-X.X.X.jar</code>.  This will invoke
 the hbase MapReduce Driver class.  Select 'rowcounter' from the choice of jobs
-offered. You may need to add the hbase conf directory to <code>$HADOOP_HOME/conf/hadoop-env.sh#HADOOP_CLASSPATH</code>
+offered. This will emit rowcouner 'usage'.  Specify tablename, column to count
+and output directory.  You may need to add the hbase conf directory to <code>$HADOOP_HOME/conf/hadoop-env.sh#HADOOP_CLASSPATH</code>
 so the rowcounter gets pointed at the right hbase cluster (or, build a new jar
 with an appropriate hbase-site.xml built into your job jar).
 </p>
-<h3>PerformanceEvaluation</h3>
-<p>See org.apache.hadoop.hbase.PerformanceEvaluation from hbase src/test.  It runs
-a mapreduce job to run concurrent clients reading and writing hbase.
-</p>
-
-<h3>Sample MR Bulk Uploader</h3>
-<p>A students/classes example based on a contribution by Naama Kraus with logs of
-documentation can be found over in src/examples/mapred.
-Its the <code>org.apache.hadoop.hbase.mapreduce.SampleUploader</code> class.
-Just copy it under src/java/org/apache/hadoop/hbase/mapred to compile and try it
-(until we start generating an hbase examples jar).  The class reads a data file
-from HDFS and per line, does an upload to HBase using TableReduce.
-Read the class comment for specification of inputs, prerequisites, etc.
-</p>
-
-<h3>Example to bulk import/load a text file into an HTable
-</h3>
-
-<p>Here's a sample program from 
-<a href="http://www.spicylogic.com/allenday/blog/category/computing/distributed-systems/hadoop/hbase/">Allen Day</a>
-that takes an HDFS text file path and an HBase table name as inputs, and loads the contents of the text file to the table
-all up in the map phase.
-</p>
-
-<blockquote><pre>
-package com.spicylogic.hbase;
-package org.apache.hadoop.hbase.mapreduce;
-import java.io.IOException;
-
-import org.apache.hadoop.conf.Configuration;
-import org.apache.hadoop.fs.Path;
-import org.apache.hadoop.hbase.HBaseConfiguration;
-import org.apache.hadoop.hbase.client.HTable;
-import org.apache.hadoop.hbase.io.BatchUpdate;
-import org.apache.hadoop.io.LongWritable;
-import org.apache.hadoop.io.Text;
-import org.apache.hadoop.mapred.FileInputFormat;
-import org.apache.hadoop.mapred.JobClient;
-import org.apache.hadoop.mapred.JobConf;
-import org.apache.hadoop.mapred.MapReduceBase;
-import org.apache.hadoop.mapred.Mapper;
-import org.apache.hadoop.mapred.OutputCollector;
-import org.apache.hadoop.mapred.Reporter;
-import org.apache.hadoop.mapred.lib.NullOutputFormat;
-import org.apache.hadoop.util.Tool;
-import org.apache.hadoop.util.ToolRunner;
-
-/**
- * Class that adds the parsed line from the input to hbase
- * in the map function.  Map has no emissions and job
- * has no reduce.
- *&#x2f;
-public class BulkImport implements Tool {
-  private static final String NAME = "BulkImport";
-  private Configuration conf;
-
-  public static class InnerMap extends MapReduceBase implements Mapper&lt;LongWritable, Text, Text, Text> {
-    private HTable table;
-    private HBaseConfiguration HBconf;
-
-    public void map(LongWritable key, Text value,
-        OutputCollector&lt;Text, Text> output, Reporter reporter)
-    throws IOException {
-      if ( table == null )
-        throw new IOException("table is null");
-      
-      // Split input line on tab character
-      String [] splits = value.toString().split("\t");
-      if ( splits.length != 4 )
-        return;
-      
-      String rowID = splits[0];
-      int timestamp  = Integer.parseInt( splits[1] );
-      String colID = splits[2];
-      String cellValue = splits[3];
-
-      reporter.setStatus("Map emitting cell for row='" + rowID +
-          "', column='" + colID + "', time='" + timestamp + "'");
-
-      BatchUpdate bu = new BatchUpdate( rowID );
-      if ( timestamp > 0 )
-        bu.setTimestamp( timestamp );
-
-      bu.put(colID, cellValue.getBytes());      
-      table.commit( bu );      
-    }
-
-    public void configure(JobConf job) {
-      HBconf = new HBaseConfiguration(job);
-      try {
-        table = new HTable( HBconf, job.get("input.table") );
-      } catch (IOException e) {
-        // TODO Auto-generated catch block
-        e.printStackTrace();
-      }
-    }
-  }
-  
-  public JobConf createSubmittableJob(String[] args) {
-    JobConf c = new JobConf(getConf(), BulkImport.class);
-    c.setJobName(NAME);
-    FileInputFormat.setInputPaths(c, new Path(args[0]));
-
-    c.set("input.table", args[1]);
-    c.setMapperClass(InnerMap.class);
-    c.setNumReduceTasks(0);
-    c.setOutputFormat(NullOutputFormat.class);
-    return c;
-  }
-  
-  static int printUsage() {
-    System.err.println("Usage: " + NAME + " &lt;input> &lt;table_name>");
-    System.err.println("\twhere &lt;input> is a tab-delimited text file with 4 columns.");
-    System.err.println("\t\tcolumn 1 = row ID");
-    System.err.println("\t\tcolumn 2 = timestamp (use a negative value for current time)");
-    System.err.println("\t\tcolumn 3 = column ID");
-    System.err.println("\t\tcolumn 4 = cell value");
-    return -1;
-  } 
-
-  public int run(@SuppressWarnings("unused") String[] args) throws Exception {
-    // Make sure there are exactly 3 parameters left.
-    if (args.length != 2) {
-      return printUsage();
-    }
-    JobClient.runJob(createSubmittableJob(args));
-    return 0;
-  }
-
-  public Configuration getConf() {
-    return this.conf;
-  } 
-
-  public void setConf(final Configuration c) {
-    this.conf = c;
-  }
-
-  public static void main(String[] args) throws Exception {
-    int errCode = ToolRunner.run(new Configuration(), new BulkImport(), args);
-    System.exit(errCode);
-  }
-}
-</pre></blockquote>
-
 */
 package org.apache.hadoop.hbase.mapreduce;