HBASE-1698 Review documentation for o.a.h.h.mapreduce
git-svn-id: https://svn.apache.org/repos/asf/hadoop/hbase/trunk@808144 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
parent
de546af646
commit
210a157a95
|
@ -8,6 +8,7 @@ Release 0.21.0 - Unreleased
|
||||||
HBASE-1737 Regions unbalanced when adding new node (recommit)
|
HBASE-1737 Regions unbalanced when adding new node (recommit)
|
||||||
HBASE-1792 [Regression] Cannot save timestamp in the future
|
HBASE-1792 [Regression] Cannot save timestamp in the future
|
||||||
HBASE-1793 [Regression] HTable.get/getRow with a ts is broken
|
HBASE-1793 [Regression] HTable.get/getRow with a ts is broken
|
||||||
|
HBASE-1698 Review documentation for o.a.h.h.mapreduce
|
||||||
|
|
||||||
IMPROVEMENTS
|
IMPROVEMENTS
|
||||||
HBASE-1760 Cleanup TODOs in HTable
|
HBASE-1760 Cleanup TODOs in HTable
|
||||||
|
|
|
@ -33,41 +33,34 @@ Input/OutputFormats, a table indexing MapReduce job, and utility
|
||||||
<p>MapReduce jobs deployed to a MapReduce cluster do not by default have access
|
<p>MapReduce jobs deployed to a MapReduce cluster do not by default have access
|
||||||
to the HBase configuration under <code>$HBASE_CONF_DIR</code> nor to HBase classes.
|
to the HBase configuration under <code>$HBASE_CONF_DIR</code> nor to HBase classes.
|
||||||
You could add <code>hbase-site.xml</code> to $HADOOP_HOME/conf and add
|
You could add <code>hbase-site.xml</code> to $HADOOP_HOME/conf and add
|
||||||
<code>hbase-X.X.X.jar</code> to the <code>$HADOOP_HOME/lib</code> and copy these
|
hbase jars to the <code>$HADOOP_HOME/lib</code> and copy these
|
||||||
changes across your cluster but the cleanest means of adding hbase configuration
|
changes across your cluster but a cleaner means of adding hbase configuration
|
||||||
and classes to the cluster <code>CLASSPATH</code> is by uncommenting
|
and classes to the cluster <code>CLASSPATH</code> is by uncommenting
|
||||||
<code>HADOOP_CLASSPATH</code> in <code>$HADOOP_HOME/conf/hadoop-env.sh</code>
|
<code>HADOOP_CLASSPATH</code> in <code>$HADOOP_HOME/conf/hadoop-env.sh</code>
|
||||||
and adding the path to the hbase jar and <code>$HBASE_CONF_DIR</code> directory.
|
adding hbase dependencies here. For example, here is how you would amend
|
||||||
Then copy the amended configuration around the cluster.
|
<code>hadoop-env.sh</code> adding the
|
||||||
You'll probably need to restart the MapReduce cluster if you want it to notice
|
built hbase jar, zookeeper (needed by hbase client), hbase conf, and the
|
||||||
the new configuration.
|
<code>PerformanceEvaluation</code> class from the built hbase test jar to the
|
||||||
</p>
|
hadoop <code>CLASSPATH<code>:
|
||||||
|
|
||||||
<p>For example, here is how you would amend <code>hadoop-env.sh</code> adding the
|
|
||||||
built hbase jar, hbase conf, and the <code>PerformanceEvaluation</code> class from
|
|
||||||
the built hbase test jar to the hadoop <code>CLASSPATH<code>:
|
|
||||||
|
|
||||||
<blockquote><pre># Extra Java CLASSPATH elements. Optional.
|
<blockquote><pre># Extra Java CLASSPATH elements. Optional.
|
||||||
# export HADOOP_CLASSPATH=
|
# export HADOOP_CLASSPATH=
|
||||||
export HADOOP_CLASSPATH=$HBASE_HOME/build/test:$HBASE_HOME/build/hbase-X.X.X.jar:$HBASE_HOME/build/hbase-X.X.X-test.jar:$HBASE_HOME/conf</pre></blockquote>
|
export HADOOP_CLASSPATH=$HBASE_HOME/build/hbase-X.X.X.jar:$HBASE_HOME/build/hbase-X.X.X-test.jar:$HBASE_HOME/conf:${HBASE_HOME}/lib/zookeeper-X.X.X.jar</pre></blockquote>
|
||||||
|
|
||||||
<p>Expand <code>$HBASE_HOME</code> in the above appropriately to suit your
|
<p>Expand <code>$HBASE_HOME</code> in the above appropriately to suit your
|
||||||
local environment.</p>
|
local environment.</p>
|
||||||
|
|
||||||
<p>After copying the above change around your cluster, this is how you would run
|
<p>After copying the above change around your cluster (and restarting), this is
|
||||||
the PerformanceEvaluation MR job to put up 4 clients (Presumes a ready mapreduce
|
how you would run the PerformanceEvaluation MR job to put up 4 clients (Presumes
|
||||||
cluster):
|
a ready mapreduce cluster):
|
||||||
|
|
||||||
<blockquote><pre>$HADOOP_HOME/bin/hadoop org.apache.hadoop.hbase.PerformanceEvaluation sequentialWrite 4</pre></blockquote>
|
<blockquote><pre>$HADOOP_HOME/bin/hadoop org.apache.hadoop.hbase.PerformanceEvaluation sequentialWrite 4</pre></blockquote>
|
||||||
|
|
||||||
The PerformanceEvaluation class wil be found on the CLASSPATH because you
|
|
||||||
added <code>$HBASE_HOME/build/test</code> to HADOOP_CLASSPATH
|
|
||||||
</p>
|
</p>
|
||||||
|
|
||||||
<p>Another possibility, if for example you do not have access to hadoop-env.sh or
|
<p>Another possibility, if for example you do not have access to hadoop-env.sh or
|
||||||
are unable to restart the hadoop cluster, is bundling the hbase jar into a mapreduce
|
are unable to restart the hadoop cluster, is bundling the hbase jars into a mapreduce
|
||||||
job jar adding it and its dependencies under the job jar <code>lib/</code>
|
job jar adding it and its dependencies under the job jar <code>lib/</code>
|
||||||
directory and the hbase conf into a job jar <code>conf/</code> directory.
|
directory and the hbase conf into the job jars top-level directory.
|
||||||
</a>
|
</a>
|
||||||
|
|
||||||
<h2><a name="sink">HBase as MapReduce job data source and sink</a></h2>
|
<h2><a name="sink">HBase as MapReduce job data source and sink</a></h2>
|
||||||
|
@ -79,7 +72,7 @@ Writing MapReduce jobs that read or write HBase, you'll probably want to subclas
|
||||||
{@link org.apache.hadoop.hbase.mapreduce.TableReducer TableReducer}. See the do-nothing
|
{@link org.apache.hadoop.hbase.mapreduce.TableReducer TableReducer}. See the do-nothing
|
||||||
pass-through classes {@link org.apache.hadoop.hbase.mapreduce.IdentityTableMapper IdentityTableMapper} and
|
pass-through classes {@link org.apache.hadoop.hbase.mapreduce.IdentityTableMapper IdentityTableMapper} and
|
||||||
{@link org.apache.hadoop.hbase.mapreduce.IdentityTableReducer IdentityTableReducer} for basic usage. For a more
|
{@link org.apache.hadoop.hbase.mapreduce.IdentityTableReducer IdentityTableReducer} for basic usage. For a more
|
||||||
involved example, see {@link org.apache.hadoop.hbase.mapreduce.BuildTableIndex BuildTableIndex}
|
involved example, see {@link org.apache.hadoop.hbase.mapreduce.RowCounter}
|
||||||
or review the <code>org.apache.hadoop.hbase.mapreduce.TestTableMapReduce</code> unit test.
|
or review the <code>org.apache.hadoop.hbase.mapreduce.TestTableMapReduce</code> unit test.
|
||||||
</p>
|
</p>
|
||||||
|
|
||||||
|
@ -106,162 +99,22 @@ to have lots of reducers so load is spread across the hbase cluster.</p>
|
||||||
currently existing regions. The
|
currently existing regions. The
|
||||||
{@link org.apache.hadoop.hbase.mapreduce.HRegionPartitioner} is suitable
|
{@link org.apache.hadoop.hbase.mapreduce.HRegionPartitioner} is suitable
|
||||||
when your table is large and your upload is not such that it will greatly
|
when your table is large and your upload is not such that it will greatly
|
||||||
alter the number of existing regions when done; other use the default
|
alter the number of existing regions when done; otherwise use the default
|
||||||
partitioner.
|
partitioner.
|
||||||
</p>
|
</p>
|
||||||
|
|
||||||
<h2><a name="examples">Example Code</a></h2>
|
<h2><a name="examples">Example Code</a></h2>
|
||||||
<h3>Sample Row Counter</h3>
|
<h3>Sample Row Counter</h3>
|
||||||
<p>See {@link org.apache.hadoop.hbase.mapreduce.RowCounter}. You should be able to run
|
<p>See {@link org.apache.hadoop.hbase.mapreduce.RowCounter}. This job uses
|
||||||
|
{@link org.apache.hadoop.hbase.mapreduce.TableInputFormat TableInputFormat} and
|
||||||
|
does a count of all rows in specified table.
|
||||||
|
You should be able to run
|
||||||
it by doing: <code>% ./bin/hadoop jar hbase-X.X.X.jar</code>. This will invoke
|
it by doing: <code>% ./bin/hadoop jar hbase-X.X.X.jar</code>. This will invoke
|
||||||
the hbase MapReduce Driver class. Select 'rowcounter' from the choice of jobs
|
the hbase MapReduce Driver class. Select 'rowcounter' from the choice of jobs
|
||||||
offered. You may need to add the hbase conf directory to <code>$HADOOP_HOME/conf/hadoop-env.sh#HADOOP_CLASSPATH</code>
|
offered. This will emit rowcouner 'usage'. Specify tablename, column to count
|
||||||
|
and output directory. You may need to add the hbase conf directory to <code>$HADOOP_HOME/conf/hadoop-env.sh#HADOOP_CLASSPATH</code>
|
||||||
so the rowcounter gets pointed at the right hbase cluster (or, build a new jar
|
so the rowcounter gets pointed at the right hbase cluster (or, build a new jar
|
||||||
with an appropriate hbase-site.xml built into your job jar).
|
with an appropriate hbase-site.xml built into your job jar).
|
||||||
</p>
|
</p>
|
||||||
<h3>PerformanceEvaluation</h3>
|
|
||||||
<p>See org.apache.hadoop.hbase.PerformanceEvaluation from hbase src/test. It runs
|
|
||||||
a mapreduce job to run concurrent clients reading and writing hbase.
|
|
||||||
</p>
|
|
||||||
|
|
||||||
<h3>Sample MR Bulk Uploader</h3>
|
|
||||||
<p>A students/classes example based on a contribution by Naama Kraus with logs of
|
|
||||||
documentation can be found over in src/examples/mapred.
|
|
||||||
Its the <code>org.apache.hadoop.hbase.mapreduce.SampleUploader</code> class.
|
|
||||||
Just copy it under src/java/org/apache/hadoop/hbase/mapred to compile and try it
|
|
||||||
(until we start generating an hbase examples jar). The class reads a data file
|
|
||||||
from HDFS and per line, does an upload to HBase using TableReduce.
|
|
||||||
Read the class comment for specification of inputs, prerequisites, etc.
|
|
||||||
</p>
|
|
||||||
|
|
||||||
<h3>Example to bulk import/load a text file into an HTable
|
|
||||||
</h3>
|
|
||||||
|
|
||||||
<p>Here's a sample program from
|
|
||||||
<a href="http://www.spicylogic.com/allenday/blog/category/computing/distributed-systems/hadoop/hbase/">Allen Day</a>
|
|
||||||
that takes an HDFS text file path and an HBase table name as inputs, and loads the contents of the text file to the table
|
|
||||||
all up in the map phase.
|
|
||||||
</p>
|
|
||||||
|
|
||||||
<blockquote><pre>
|
|
||||||
package com.spicylogic.hbase;
|
|
||||||
package org.apache.hadoop.hbase.mapreduce;
|
|
||||||
import java.io.IOException;
|
|
||||||
|
|
||||||
import org.apache.hadoop.conf.Configuration;
|
|
||||||
import org.apache.hadoop.fs.Path;
|
|
||||||
import org.apache.hadoop.hbase.HBaseConfiguration;
|
|
||||||
import org.apache.hadoop.hbase.client.HTable;
|
|
||||||
import org.apache.hadoop.hbase.io.BatchUpdate;
|
|
||||||
import org.apache.hadoop.io.LongWritable;
|
|
||||||
import org.apache.hadoop.io.Text;
|
|
||||||
import org.apache.hadoop.mapred.FileInputFormat;
|
|
||||||
import org.apache.hadoop.mapred.JobClient;
|
|
||||||
import org.apache.hadoop.mapred.JobConf;
|
|
||||||
import org.apache.hadoop.mapred.MapReduceBase;
|
|
||||||
import org.apache.hadoop.mapred.Mapper;
|
|
||||||
import org.apache.hadoop.mapred.OutputCollector;
|
|
||||||
import org.apache.hadoop.mapred.Reporter;
|
|
||||||
import org.apache.hadoop.mapred.lib.NullOutputFormat;
|
|
||||||
import org.apache.hadoop.util.Tool;
|
|
||||||
import org.apache.hadoop.util.ToolRunner;
|
|
||||||
|
|
||||||
/**
|
|
||||||
* Class that adds the parsed line from the input to hbase
|
|
||||||
* in the map function. Map has no emissions and job
|
|
||||||
* has no reduce.
|
|
||||||
*/
|
|
||||||
public class BulkImport implements Tool {
|
|
||||||
private static final String NAME = "BulkImport";
|
|
||||||
private Configuration conf;
|
|
||||||
|
|
||||||
public static class InnerMap extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> {
|
|
||||||
private HTable table;
|
|
||||||
private HBaseConfiguration HBconf;
|
|
||||||
|
|
||||||
public void map(LongWritable key, Text value,
|
|
||||||
OutputCollector<Text, Text> output, Reporter reporter)
|
|
||||||
throws IOException {
|
|
||||||
if ( table == null )
|
|
||||||
throw new IOException("table is null");
|
|
||||||
|
|
||||||
// Split input line on tab character
|
|
||||||
String [] splits = value.toString().split("\t");
|
|
||||||
if ( splits.length != 4 )
|
|
||||||
return;
|
|
||||||
|
|
||||||
String rowID = splits[0];
|
|
||||||
int timestamp = Integer.parseInt( splits[1] );
|
|
||||||
String colID = splits[2];
|
|
||||||
String cellValue = splits[3];
|
|
||||||
|
|
||||||
reporter.setStatus("Map emitting cell for row='" + rowID +
|
|
||||||
"', column='" + colID + "', time='" + timestamp + "'");
|
|
||||||
|
|
||||||
BatchUpdate bu = new BatchUpdate( rowID );
|
|
||||||
if ( timestamp > 0 )
|
|
||||||
bu.setTimestamp( timestamp );
|
|
||||||
|
|
||||||
bu.put(colID, cellValue.getBytes());
|
|
||||||
table.commit( bu );
|
|
||||||
}
|
|
||||||
|
|
||||||
public void configure(JobConf job) {
|
|
||||||
HBconf = new HBaseConfiguration(job);
|
|
||||||
try {
|
|
||||||
table = new HTable( HBconf, job.get("input.table") );
|
|
||||||
} catch (IOException e) {
|
|
||||||
// TODO Auto-generated catch block
|
|
||||||
e.printStackTrace();
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
public JobConf createSubmittableJob(String[] args) {
|
|
||||||
JobConf c = new JobConf(getConf(), BulkImport.class);
|
|
||||||
c.setJobName(NAME);
|
|
||||||
FileInputFormat.setInputPaths(c, new Path(args[0]));
|
|
||||||
|
|
||||||
c.set("input.table", args[1]);
|
|
||||||
c.setMapperClass(InnerMap.class);
|
|
||||||
c.setNumReduceTasks(0);
|
|
||||||
c.setOutputFormat(NullOutputFormat.class);
|
|
||||||
return c;
|
|
||||||
}
|
|
||||||
|
|
||||||
static int printUsage() {
|
|
||||||
System.err.println("Usage: " + NAME + " <input> <table_name>");
|
|
||||||
System.err.println("\twhere <input> is a tab-delimited text file with 4 columns.");
|
|
||||||
System.err.println("\t\tcolumn 1 = row ID");
|
|
||||||
System.err.println("\t\tcolumn 2 = timestamp (use a negative value for current time)");
|
|
||||||
System.err.println("\t\tcolumn 3 = column ID");
|
|
||||||
System.err.println("\t\tcolumn 4 = cell value");
|
|
||||||
return -1;
|
|
||||||
}
|
|
||||||
|
|
||||||
public int run(@SuppressWarnings("unused") String[] args) throws Exception {
|
|
||||||
// Make sure there are exactly 3 parameters left.
|
|
||||||
if (args.length != 2) {
|
|
||||||
return printUsage();
|
|
||||||
}
|
|
||||||
JobClient.runJob(createSubmittableJob(args));
|
|
||||||
return 0;
|
|
||||||
}
|
|
||||||
|
|
||||||
public Configuration getConf() {
|
|
||||||
return this.conf;
|
|
||||||
}
|
|
||||||
|
|
||||||
public void setConf(final Configuration c) {
|
|
||||||
this.conf = c;
|
|
||||||
}
|
|
||||||
|
|
||||||
public static void main(String[] args) throws Exception {
|
|
||||||
int errCode = ToolRunner.run(new Configuration(), new BulkImport(), args);
|
|
||||||
System.exit(errCode);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
</pre></blockquote>
|
|
||||||
|
|
||||||
*/
|
*/
|
||||||
package org.apache.hadoop.hbase.mapreduce;
|
package org.apache.hadoop.hbase.mapreduce;
|
||||||
|
|
Loading…
Reference in New Issue