HBASE-1698 Review documentation for o.a.h.h.mapreduce

git-svn-id: https://svn.apache.org/repos/asf/hadoop/hbase/trunk@808144 13f79535-47bb-0310-9956-ffa450edef68
2009-08-26 18:13:04 +00:00 · 2009-08-26 18:13:04 +00:00 · 210a157a95
parent de546af646
commit 210a157a95
2 changed files with 22 additions and 168 deletions
--- a/CHANGES.txt
+++ b/CHANGES.txt
@ -8,6 +8,7 @@ Release 0.21.0 - Unreleased
   HBASE-1737  Regions unbalanced when adding new node (recommit)
   HBASE-1792  [Regression] Cannot save timestamp in the future
   HBASE-1793  [Regression] HTable.get/getRow with a ts is broken
   HBASE-1698  Review documentation for o.a.h.h.mapreduce
  IMPROVEMENTS
   HBASE-1760  Cleanup TODOs in HTable
--- a/src/java/org/apache/hadoop/hbase/mapreduce/package-info.java
+++ b/src/java/org/apache/hadoop/hbase/mapreduce/package-info.java
@ -33,41 +33,34 @@ Input/OutputFormats, a table indexing MapReduce job, and utility
 <p>MapReduce jobs deployed to a MapReduce cluster do not by default have access
 to the HBase configuration under <code>$HBASE_CONF_DIR</code> nor to HBase classes.
 You could add <code>hbase-site.xml</code> to $HADOOP_HOME/conf and add
-<code>hbase-X.X.X.jar</code> to the <code>$HADOOP_HOME/lib</code> and copy these
+hbase jars to the <code>$HADOOP_HOME/lib</code> and copy these
-changes across your cluster but the cleanest means of adding hbase configuration
+changes across your cluster but a cleaner means of adding hbase configuration
 and classes to the cluster <code>CLASSPATH</code> is by uncommenting
 <code>HADOOP_CLASSPATH</code> in <code>$HADOOP_HOME/conf/hadoop-env.sh</code>
-and adding the path to the hbase jar and <code>$HBASE_CONF_DIR</code> directory.
+adding hbase dependencies here.  For example, here is how you would amend
-Then copy the amended configuration around the cluster.
+<code>hadoop-env.sh</code> adding the
-You'll probably need to restart the MapReduce cluster if you want it to notice
+built hbase jar, zookeeper (needed by hbase client), hbase conf, and the
-the new configuration.
+<code>PerformanceEvaluation</code> class from the built hbase test jar to the
-</p>
+hadoop <code>CLASSPATH<code>:
 <p>For example, here is how you would amend <code>hadoop-env.sh</code> adding the
 built hbase jar, hbase conf, and the <code>PerformanceEvaluation</code> class from
 the built hbase test jar to the hadoop <code>CLASSPATH<code>:
 <blockquote><pre># Extra Java CLASSPATH elements. Optional.
 # export HADOOP_CLASSPATH=
-export HADOOP_CLASSPATH=$HBASE_HOME/build/test:$HBASE_HOME/build/hbase-X.X.X.jar:$HBASE_HOME/build/hbase-X.X.X-test.jar:$HBASE_HOME/conf</pre></blockquote>
+export HADOOP_CLASSPATH=$HBASE_HOME/build/hbase-X.X.X.jar:$HBASE_HOME/build/hbase-X.X.X-test.jar:$HBASE_HOME/conf:${HBASE_HOME}/lib/zookeeper-X.X.X.jar</pre></blockquote>
 <p>Expand <code>$HBASE_HOME</code> in the above appropriately to suit your
 local environment.</p>
-<p>After copying the above change around your cluster, this is how you would run
+<p>After copying the above change around your cluster (and restarting), this is
-the PerformanceEvaluation MR job to put up 4 clients (Presumes a ready mapreduce
+how you would run the PerformanceEvaluation MR job to put up 4 clients (Presumes
-cluster):
+a ready mapreduce cluster):
 <blockquote><pre>$HADOOP_HOME/bin/hadoop org.apache.hadoop.hbase.PerformanceEvaluation sequentialWrite 4</pre></blockquote>
 The PerformanceEvaluation class wil be found on the CLASSPATH because you
 added <code>$HBASE_HOME/build/test</code> to HADOOP_CLASSPATH
 </p>
 <p>Another possibility, if for example you do not have access to hadoop-env.sh or
-are unable to restart the hadoop cluster, is bundling the hbase jar into a mapreduce
+are unable to restart the hadoop cluster, is bundling the hbase jars into a mapreduce
 job jar adding it and its dependencies under the job jar <code>lib/</code>
-directory and the hbase conf into a job jar <code>conf/</code> directory.
+directory and the hbase conf into the job jars top-level directory.
 </a>
 <h2><a name="sink">HBase as MapReduce job data source and sink</a></h2>
@ -79,7 +72,7 @@ Writing MapReduce jobs that read or write HBase, you'll probably want to subclas
 {@link org.apache.hadoop.hbase.mapreduce.TableReducer TableReducer}.  See the do-nothing
 pass-through classes {@link org.apache.hadoop.hbase.mapreduce.IdentityTableMapper IdentityTableMapper} and
 {@link org.apache.hadoop.hbase.mapreduce.IdentityTableReducer IdentityTableReducer} for basic usage.  For a more
-involved example, see {@link org.apache.hadoop.hbase.mapreduce.BuildTableIndex BuildTableIndex}
+involved example, see {@link org.apache.hadoop.hbase.mapreduce.RowCounter}
 or review the <code>org.apache.hadoop.hbase.mapreduce.TestTableMapReduce</code> unit test.
 </p>
@ -106,162 +99,22 @@ to have lots of reducers so load is spread across the hbase cluster.</p>
 currently existing regions.  The 
 {@link org.apache.hadoop.hbase.mapreduce.HRegionPartitioner} is suitable
 when your table is large and your upload is not such that it will greatly
-alter the number of existing regions when done; other use the default
+alter the number of existing regions when done; otherwise use the default
 partitioner.
 </p>
 <h2><a name="examples">Example Code</a></h2>
 <h3>Sample Row Counter</h3>
-<p>See {@link org.apache.hadoop.hbase.mapreduce.RowCounter}.  You should be able to run
+<p>See {@link org.apache.hadoop.hbase.mapreduce.RowCounter}.  This job uses
 {@link org.apache.hadoop.hbase.mapreduce.TableInputFormat TableInputFormat} and
 does a count of all rows in specified table.
 You should be able to run
 it by doing: <code>% ./bin/hadoop jar hbase-X.X.X.jar</code>.  This will invoke
 the hbase MapReduce Driver class.  Select 'rowcounter' from the choice of jobs
-offered. You may need to add the hbase conf directory to <code>$HADOOP_HOME/conf/hadoop-env.sh#HADOOP_CLASSPATH</code>
+offered. This will emit rowcouner 'usage'.  Specify tablename, column to count
 and output directory.  You may need to add the hbase conf directory to <code>$HADOOP_HOME/conf/hadoop-env.sh#HADOOP_CLASSPATH</code>
 so the rowcounter gets pointed at the right hbase cluster (or, build a new jar
 with an appropriate hbase-site.xml built into your job jar).
 </p>
 <h3>PerformanceEvaluation</h3>
 <p>See org.apache.hadoop.hbase.PerformanceEvaluation from hbase src/test.  It runs
 a mapreduce job to run concurrent clients reading and writing hbase.
 </p>
 <h3>Sample MR Bulk Uploader</h3>
 <p>A students/classes example based on a contribution by Naama Kraus with logs of
 documentation can be found over in src/examples/mapred.
 Its the <code>org.apache.hadoop.hbase.mapreduce.SampleUploader</code> class.
 Just copy it under src/java/org/apache/hadoop/hbase/mapred to compile and try it
 (until we start generating an hbase examples jar).  The class reads a data file
 from HDFS and per line, does an upload to HBase using TableReduce.
 Read the class comment for specification of inputs, prerequisites, etc.
 </p>
 <h3>Example to bulk import/load a text file into an HTable
 </h3>
 <p>Here's a sample program from 
 <a href="http://www.spicylogic.com/allenday/blog/category/computing/distributed-systems/hadoop/hbase/">Allen Day</a>
 that takes an HDFS text file path and an HBase table name as inputs, and loads the contents of the text file to the table
 all up in the map phase.
 </p>
 <blockquote><pre>
 package com.spicylogic.hbase;
 package org.apache.hadoop.hbase.mapreduce;
 import java.io.IOException;
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.fs.Path;
 import org.apache.hadoop.hbase.HBaseConfiguration;
 import org.apache.hadoop.hbase.client.HTable;
 import org.apache.hadoop.hbase.io.BatchUpdate;
 import org.apache.hadoop.io.LongWritable;
 import org.apache.hadoop.io.Text;
 import org.apache.hadoop.mapred.FileInputFormat;
 import org.apache.hadoop.mapred.JobClient;
 import org.apache.hadoop.mapred.JobConf;
 import org.apache.hadoop.mapred.MapReduceBase;
 import org.apache.hadoop.mapred.Mapper;
 import org.apache.hadoop.mapred.OutputCollector;
 import org.apache.hadoop.mapred.Reporter;
 import org.apache.hadoop.mapred.lib.NullOutputFormat;
 import org.apache.hadoop.util.Tool;
 import org.apache.hadoop.util.ToolRunner;
 /**
 * Class that adds the parsed line from the input to hbase
 * in the map function.  Map has no emissions and job
 * has no reduce.
 *&#x2f;
 public class BulkImport implements Tool {
  private static final String NAME = "BulkImport";
  private Configuration conf;
  public static class InnerMap extends MapReduceBase implements Mapper&lt;LongWritable, Text, Text, Text> {
    private HTable table;
    private HBaseConfiguration HBconf;
    public void map(LongWritable key, Text value,
        OutputCollector&lt;Text, Text> output, Reporter reporter)
    throws IOException {
      if ( table == null )
        throw new IOException("table is null");
      // Split input line on tab character
      String [] splits = value.toString().split("\t");
      if ( splits.length != 4 )
        return;
      String rowID = splits[0];
      int timestamp  = Integer.parseInt( splits[1] );
      String colID = splits[2];
      String cellValue = splits[3];
      reporter.setStatus("Map emitting cell for row='" + rowID +
          "', column='" + colID + "', time='" + timestamp + "'");
      BatchUpdate bu = new BatchUpdate( rowID );
      if ( timestamp > 0 )
        bu.setTimestamp( timestamp );
      bu.put(colID, cellValue.getBytes());      
      table.commit( bu );      
    }
    public void configure(JobConf job) {
      HBconf = new HBaseConfiguration(job);
      try {
        table = new HTable( HBconf, job.get("input.table") );
      } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
      }
    }
  }
  public JobConf createSubmittableJob(String[] args) {
    JobConf c = new JobConf(getConf(), BulkImport.class);
    c.setJobName(NAME);
    FileInputFormat.setInputPaths(c, new Path(args[0]));
    c.set("input.table", args[1]);
    c.setMapperClass(InnerMap.class);
    c.setNumReduceTasks(0);
    c.setOutputFormat(NullOutputFormat.class);
    return c;
  }
  static int printUsage() {
    System.err.println("Usage: " + NAME + " &lt;input> &lt;table_name>");
    System.err.println("\twhere &lt;input> is a tab-delimited text file with 4 columns.");
    System.err.println("\t\tcolumn 1 = row ID");
    System.err.println("\t\tcolumn 2 = timestamp (use a negative value for current time)");
    System.err.println("\t\tcolumn 3 = column ID");
    System.err.println("\t\tcolumn 4 = cell value");
    return -1;
  } 
  public int run(@SuppressWarnings("unused") String[] args) throws Exception {
    // Make sure there are exactly 3 parameters left.
    if (args.length != 2) {
      return printUsage();
    }
    JobClient.runJob(createSubmittableJob(args));
    return 0;
  }
  public Configuration getConf() {
    return this.conf;
  } 
  public void setConf(final Configuration c) {
    this.conf = c;
  }
  public static void main(String[] args) throws Exception {
    int errCode = ToolRunner.run(new Configuration(), new BulkImport(), args);
    System.exit(errCode);
  }
 }
 </pre></blockquote>
 */
 package org.apache.hadoop.hbase.mapreduce;