MapReduce jobs deployed to a MapReduce cluster do not by default have access
-to the HBase configuration under $HBASE_CONF_DIR nor to HBase classes.
-You could add hbase-site.xml to $HADOOP_HOME/conf and add
-hbase-X.X.X.jar to the $HADOOP_HOME/lib and copy these
-changes across your cluster but the cleanest means of adding hbase configuration
-and classes to the cluster CLASSPATH is by uncommenting
-HADOOP_CLASSPATH in $HADOOP_HOME/conf/hadoop-env.sh
-adding hbase dependencies here. For example, here is how you would amend
-hadoop-env.sh adding the
-built hbase jar, zookeeper (needed by hbase client), hbase conf, and the
-PerformanceEvaluation class from the built hbase test jar to the
-hadoop CLASSPATH:
-
-
Expand $HBASE_HOME in the above appropriately to suit your
-local environment.
-
-
After copying the above change around your cluster (and restarting), this is
-how you would run the PerformanceEvaluation MR job to put up 4 clients (Presumes
-a ready mapreduce cluster):
-
-
-
-The PerformanceEvaluation class wil be found on the CLASSPATH because you
-added $HBASE_HOME/build/test to HADOOP_CLASSPATH
-
-
-
Another possibility, if for example you do not have access to hadoop-env.sh or
-are unable to restart the hadoop cluster, is bundling the hbase jar into a mapreduce
-job jar adding it and its dependencies under the job jar lib/
-directory and the hbase conf into a job jar conf/ directory.
-
-
-
HBase can be used as a data source, {@link org.apache.hadoop.hbase.mapred.TableInputFormat TableInputFormat},
-and data sink, {@link org.apache.hadoop.hbase.mapred.TableOutputFormat TableOutputFormat}, for MapReduce jobs.
-Writing MapReduce jobs that read or write HBase, you'll probably want to subclass
-{@link org.apache.hadoop.hbase.mapred.TableMap TableMap} and/or
-{@link org.apache.hadoop.hbase.mapred.TableReduce TableReduce}. See the do-nothing
-pass-through classes {@link org.apache.hadoop.hbase.mapred.IdentityTableMap IdentityTableMap} and
-{@link org.apache.hadoop.hbase.mapred.IdentityTableReduce IdentityTableReduce} for basic usage. For a more
-involved example, see BuildTableIndex
-or review the org.apache.hadoop.hbase.mapred.TestTableMapReduce unit test.
-
-
-
Running mapreduce jobs that have hbase as source or sink, you'll need to
-specify source/sink table and column names in your configuration.
-
-
Reading from hbase, the TableInputFormat asks hbase for the list of
-regions and makes a map-per-region or mapred.map.tasks maps,
-whichever is smaller (If your job only has two maps, up mapred.map.tasks
-to a number > number of regions). Maps will run on the adjacent TaskTracker
-if you are running a TaskTracer and RegionServer per node.
-Writing, it may make sense to avoid the reduce step and write yourself back into
-hbase from inside your map. You'd do this when your job does not need the sort
-and collation that mapreduce does on the map emitted data; on insert,
-hbase 'sorts' so there is no point double-sorting (and shuffling data around
-your mapreduce cluster) unless you need to. If you do not need the reduce,
-you might just have your map emit counts of records processed just so the
-framework's report at the end of your job has meaning or set the number of
-reduces to zero and use TableOutputFormat. See example code
-below. If running the reduce step makes sense in your case, its usually better
-to have lots of reducers so load is spread across the hbase cluster.
-
-
There is also a new hbase partitioner that will run as many reducers as
-currently existing regions. The
-{@link org.apache.hadoop.hbase.mapred.HRegionPartitioner} is suitable
-when your table is large and your upload is not such that it will greatly
-alter the number of existing regions when done; other use the default
-partitioner.
-
See {@link org.apache.hadoop.hbase.mapred.RowCounter}. You should be able to run
-it by doing: % ./bin/hadoop jar hbase-X.X.X.jar. This will invoke
-the hbase MapReduce Driver class. Select 'rowcounter' from the choice of jobs
-offered. You may need to add the hbase conf directory to $HADOOP_HOME/conf/hadoop-env.sh#HADOOP_CLASSPATH
-so the rowcounter gets pointed at the right hbase cluster (or, build a new jar
-with an appropriate hbase-site.xml built into your job jar).
-
-
PerformanceEvaluation
-
See org.apache.hadoop.hbase.PerformanceEvaluation from hbase src/test. It runs
-a mapreduce job to run concurrent clients reading and writing hbase.
-
-
+
See HBase and MapReduce
+in the HBase Reference Guide for mapreduce over hbase documentation.
*/
package org.apache.hadoop.hbase.mapred;
diff --git a/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/package-info.java b/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/package-info.java
index 11ba5ac5923..f811e210008 100644
--- a/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/package-info.java
+++ b/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/package-info.java
@@ -20,144 +20,7 @@
Provides HBase MapReduce
Input/OutputFormats, a table indexing MapReduce job, and utility
-
MapReduce jobs deployed to a MapReduce cluster do not by default have access
-to the HBase configuration under $HBASE_CONF_DIR nor to HBase classes.
-You could add hbase-site.xml to
-$HADOOP_HOME/conf and add
-HBase jars to the $HADOOP_HOME/lib and copy these
-changes across your cluster (or edit conf/hadoop-env.sh and add them to the
-HADOOP_CLASSPATH variable) but this will pollute your
-hadoop install with HBase references; its also obnoxious requiring restart of
-the hadoop cluster before it'll notice your HBase additions.
-
-
As of 0.90.x, HBase will just add its dependency jars to the job
-configuration; the dependencies just need to be available on the local
-CLASSPATH. For example, to run the bundled HBase
-{@link org.apache.hadoop.hbase.mapreduce.RowCounter} mapreduce job against a table named usertable,
-type:
-
-
-$ HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-0.90.0.jar rowcounter usertable
-
-
-Expand $HBASE_HOME and $HADOOP_HOME in the above
-appropriately to suit your local environment. The content of HADOOP_CLASSPATH
-is set to the HBase CLASSPATH via backticking the command
-${HBASE_HOME}/bin/hbase classpath.
-
-
When the above runs, internally, the HBase jar finds its zookeeper and
-guava,
-etc., dependencies on the passed
-HADOOP_CLASSPATH and adds the found jars to the mapreduce
-job configuration. See the source at
-TableMapReduceUtil#addDependencyJars(org.apache.hadoop.mapreduce.Job)
-for how this is done.
-
-
The above may not work if you are running your HBase from its build directory;
-i.e. you've done $ mvn test install at
-${HBASE_HOME} and you are now
-trying to use this build in your mapreduce job. If you get
-
The HBase jar also serves as a Driver for some bundled mapreduce jobs. To
-learn about the bundled mapreduce jobs run:
-
-$ ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-0.90.0-SNAPSHOT.jar
-An example program must be given as the first argument.
-Valid program names are:
- copytable: Export a table from local cluster to peer cluster
- completebulkload: Complete a bulk data load.
- export: Write table data to HDFS.
- import: Import data written by Export.
- importtsv: Import data in TSV format.
- rowcounter: Count rows in HBase table
-
HBase can be used as a data source, {@link org.apache.hadoop.hbase.mapreduce.TableInputFormat TableInputFormat},
-and data sink, {@link org.apache.hadoop.hbase.mapreduce.TableOutputFormat TableOutputFormat}
-or {@link org.apache.hadoop.hbase.mapreduce.MultiTableOutputFormat MultiTableOutputFormat},
-for MapReduce jobs.
-Writing MapReduce jobs that read or write HBase, you'll probably want to subclass
-{@link org.apache.hadoop.hbase.mapreduce.TableMapper TableMapper} and/or
-{@link org.apache.hadoop.hbase.mapreduce.TableReducer TableReducer}. See the do-nothing
-pass-through classes {@link org.apache.hadoop.hbase.mapreduce.IdentityTableMapper IdentityTableMapper} and
-{@link org.apache.hadoop.hbase.mapreduce.IdentityTableReducer IdentityTableReducer} for basic usage. For a more
-involved example, see {@link org.apache.hadoop.hbase.mapreduce.RowCounter}
-or review the org.apache.hadoop.hbase.mapreduce.TestTableMapReduce unit test.
-
-
-
Running mapreduce jobs that have HBase as source or sink, you'll need to
-specify source/sink table and column names in your configuration.
-
-
Reading from HBase, the TableInputFormat asks HBase for the list of
-regions and makes a map-per-region or mapreduce.job.maps maps,
-whichever is smaller (If your job only has two maps, up mapreduce.job.maps
-to a number > number of regions). Maps will run on the adjacent TaskTracker
-if you are running a TaskTracer and RegionServer per node.
-Writing, it may make sense to avoid the reduce step and write yourself back into
-HBase from inside your map. You'd do this when your job does not need the sort
-and collation that mapreduce does on the map emitted data; on insert,
-HBase 'sorts' so there is no point double-sorting (and shuffling data around
-your mapreduce cluster) unless you need to. If you do not need the reduce,
-you might just have your map emit counts of records processed just so the
-framework's report at the end of your job has meaning or set the number of
-reduces to zero and use TableOutputFormat. See example code
-below. If running the reduce step makes sense in your case, its usually better
-to have lots of reducers so load is spread across the HBase cluster.
-
-
There is also a new HBase partitioner that will run as many reducers as
-currently existing regions. The
-{@link org.apache.hadoop.hbase.mapreduce.HRegionPartitioner} is suitable
-when your table is large and your upload is not such that it will greatly
-alter the number of existing regions when done; otherwise use the default
-partitioner.
-
If importing into a new table, its possible to by-pass the HBase API
-and write your content directly to the filesystem properly formatted as
-HBase data files (HFiles). Your import will run faster, perhaps an order of
-magnitude faster if not more. For more on how this mechanism works, see
-Bulk Loads
-documentation.
-
See {@link org.apache.hadoop.hbase.mapreduce.RowCounter}. This job uses
-{@link org.apache.hadoop.hbase.mapreduce.TableInputFormat TableInputFormat} and
-does a count of all rows in specified table.
-You should be able to run
-it by doing: % ./bin/hadoop jar hbase-X.X.X.jar. This will invoke
-the hbase MapReduce Driver class. Select 'rowcounter' from the choice of jobs
-offered. This will emit rowcouner 'usage'. Specify tablename, column to count
-and output directory. You may need to add the hbase conf directory to $HADOOP_HOME/conf/hadoop-env.sh#HADOOP_CLASSPATH
-so the rowcounter gets pointed at the right hbase cluster (or, build a new jar
-with an appropriate hbase-site.xml built into your job jar).
-
+
See HBase and MapReduce
+in the HBase Reference Guide for mapreduce over hbase documentation.
*/
package org.apache.hadoop.hbase.mapreduce;
diff --git a/src/main/docbkx/book.xml b/src/main/docbkx/book.xml
index cb84c21badb..1fca2be7985 100644
--- a/src/main/docbkx/book.xml
+++ b/src/main/docbkx/book.xml
@@ -664,18 +664,76 @@ htable.put(put);
-
- HBase and MapReduce
- See
- HBase and MapReduce up in javadocs.
- Start there. Below is some additional help.
- For more information about MapReduce (i.e., the framework in general), see the Hadoop site (TODO: Need good links here --
- we used to have some but they rotted against apache hadoop).
-
- Notice to Mapreduce users of HBase 0.96.1 and above
- Some mapreduce jobs that use HBase fail to launch. The symptom is an
- exception similar to the following:
-
+
+ HBase and MapReduce
+ Apache MapReduce is a software framework used to analyze large amounts of data, and is
+ the framework used most often with Apache Hadoop. MapReduce itself is out of the
+ scope of this document. A good place to get started with MapReduce is . MapReduce version
+ 2 (MR2)is now part of YARN.
+
+ This chapter discusses specific configuration steps you need to take to use MapReduce on
+ data within HBase. In addition, it discusses other interactions and issues between HBase and
+ MapReduce jobs.
+
+ mapred and mapreduce
+ There are two mapreduce packages in HBase as in MapReduce itself: org.apache.hadoop.hbase.mapred
+ and org.apache.hadoop.hbase.mapreduce. The former does old-style API and the latter
+ the new style. The latter has more facility though you can usually find an equivalent in the older
+ package. Pick the package that goes with your mapreduce deploy. When in doubt or starting over, pick the
+ org.apache.hadoop.hbase.mapreduce. In the notes below, we refer to
+ o.a.h.h.mapreduce but replace with the o.a.h.h.mapred if that is what you are using.
+
+
+
+
+
+ HBase, MapReduce, and the CLASSPATH
+ Ny default, MapReduce jobs deployed to a MapReduce cluster do not have access to either
+ the HBase configuration under $HBASE_CONF_DIR or the HBase classes.
+ To give the MapReduce jobs the access they need, you could add
+ hbase-site.xml to the
+ $HADOOP_HOME/conf/ directory and add the
+ HBase JARs to the HADOOP_HOME/conf/
+ directory, then copy these changes across your cluster. You could add hbase-site.xml to
+ $HADOOP_HOME/conf and add HBase jars to the $HADOOP_HOME/lib. You would then need to copy
+ these changes across your cluster or edit
+ $HADOOP_HOMEconf/hadoop-env.sh and add
+ them to the HADOOP_CLASSPATH variable. However, this approach is not
+ recommended because it will pollute your Hadoop install with HBase references. It also
+ requires you to restart the Hadoop cluster before Hadoop can use the HBase data.
+ Since HBase 0.90.x, HBase adds its dependency JARs to the job configuration itself. The
+ dependencies only need to be available on the local CLASSPATH. The following example runs
+ the bundled HBase RowCounter
+ MapReduce job against a table named usertable If you have not set
+ the environment variables expected in the command (the parts prefixed by a
+ $ sign and curly braces), you can use the actual system paths instead.
+ Be sure to use the correct version of the HBase JAR for your system. The backticks
+ (` symbols) cause ths shell to execute the sub-commands, setting the
+ CLASSPATH as part of the command. This example assumes you use a BASH-compatible shell.
+ $ HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-0.90.0.jar rowcounter usertable
+ When the command runs, internally, the HBase JAR finds the dependencies it needs for
+ zookeeper, guava, and its other dependencies on the passed HADOOP_CLASSPATH
+ and adds the JARs to the MapReduce job configuration. See the source at
+ TableMapReduceUtil#addDependencyJars(org.apache.hadoop.mapreduce.Job) for how this is done.
+
+ The example may not work if you are running HBase from its build directory rather
+ than an installed location. You may see an error like the following:
+ java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.mapreduce.RowCounter$RowCounterMapper
+ If this occurs, try modifying the command as follows, so that it uses the HBase JARs
+ from the target/ directory within the build environment.
+ $ HADOOP_CLASSPATH=${HBASE_HOME}/target/hbase-0.90.0-SNAPSHOT.jar:`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/target/hbase-0.90.0-SNAPSHOT.jar rowcounter usertable
+
+
+ Notice to Mapreduce users of HBase 0.96.1 and above
+ Some mapreduce jobs that use HBase fail to launch. The symptom is an exception similar
+ to the following:
+
Exception in thread "main" java.lang.IllegalAccessError: class
com.google.protobuf.ZeroCopyLiteralByteString cannot access its superclass
com.google.protobuf.LiteralByteString
@@ -703,63 +761,158 @@ Exception in thread "main" java.lang.IllegalAccessError: class
at
org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:100)
...
-
- This is because of an optimization introduced in HBASE-9867
- that inadvertently introduced a classloader dependency.
-
- This affects both jobs using the -libjars option and
- "fat jar," those which package their runtime dependencies in a nested
- lib folder.
- In order to satisfy the new classloader requirements,
- hbase-protocol.jar must be included in Hadoop's classpath. This can be
- resolved system-wide by including a reference to the hbase-protocol.jar in
- hadoop's lib directory, via a symlink or by copying the jar into the new
- location.
- This can also be achieved on a per-job launch basis by including it
- in the HADOOP_CLASSPATH environment variable at job submission
- time. When launching jobs that package their dependencies, all three of the
- following job launching commands satisfy this requirement:
-
-$ HADOOP_CLASSPATH=/path/to/hbase-protocol.jar:/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass
-$ HADOOP_CLASSPATH=$(hbase mapredcp):/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass
-$ HADOOP_CLASSPATH=$(hbase classpath) hadoop jar MyJob.jar MyJobMainClass
-
- For jars that do not package their dependencies, the following command
- structure is necessary:
-
-$ HADOOP_CLASSPATH=$(hbase mapredcp):/etc/hbase/conf hadoop jar MyApp.jar MyJobMainClass -libjars $(hbase mapredcp | tr ':' ',') ...
-
- See also HBASE-10304
- for further discussion of this issue.
-
-
- Map-Task Splitting
-
- The Default HBase MapReduce Splitter
- When TableInputFormat
- is used to source an HBase table in a MapReduce job,
- its splitter will make a map task for each region of the table.
- Thus, if there are 100 regions in the table, there will be
- 100 map-tasks for the job - regardless of how many column families are selected in the Scan.
+
+ This is caused by an optimization introduced in HBASE-9867 that
+ inadvertently introduced a classloader dependency.
+ This affects both jobs using the -libjars option and "fat jar," those
+ which package their runtime dependencies in a nested lib folder.
+ In order to satisfy the new classloader requirements, hbase-protocol.jar must be
+ included in Hadoop's classpath. See for current recommendations for resolving
+ classpath errors. The following is included for historical purposes.
+ This can be resolved system-wide by including a reference to the hbase-protocol.jar in
+ hadoop's lib directory, via a symlink or by copying the jar into the new location.
+ This can also be achieved on a per-job launch basis by including it in the
+ HADOOP_CLASSPATH environment variable at job submission time. When
+ launching jobs that package their dependencies, all three of the following job launching
+ commands satisfy this requirement:
+
+$ HADOOP_CLASSPATH=/path/to/hbase-protocol.jar:/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass
+$ HADOOP_CLASSPATH=$(hbase mapredcp):/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass
+$ HADOOP_CLASSPATH=$(hbase classpath) hadoop jar MyJob.jar MyJobMainClass
+
+ For jars that do not package their dependencies, the following command structure is
+ necessary:
+
+$ HADOOP_CLASSPATH=$(hbase mapredcp):/etc/hbase/conf hadoop jar MyApp.jar MyJobMainClass -libjars $(hbase mapredcp | tr ':' ',') ...
+
+ See also HBASE-10304 for
+ further discussion of this issue.
+
-
- Custom Splitters
- For those interested in implementing custom splitters, see the method getSplits in
- TableInputFormatBase.
- That is where the logic for map-task assignment resides.
-
+
+
+
+ Bundled HBase MapReduce Jobs
+ The HBase JAR also serves as a Driver for some bundled mapreduce jobs. To learn about
+ the bundled MapReduce jobs, run the following command.
+
+ $ ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-0.90.0-SNAPSHOT.jar
+An example program must be given as the first argument.
+Valid program names are:
+ copytable: Export a table from local cluster to peer cluster
+ completebulkload: Complete a bulk data load.
+ export: Write table data to HDFS.
+ import: Import data written by Export.
+ importtsv: Import data in TSV format.
+ rowcounter: Count rows in HBase table
+
+ Each of the valid program names are bundled MapReduce jobs. To run one of the jobs,
+ model your command after the following example.
+ $ ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-0.90.0-SNAPSHOT.jar rowcounter myTable
-
-
- HBase MapReduce Examples
-
- HBase MapReduce Read Example
- The following is an example of using HBase as a MapReduce source in read-only manner. Specifically,
- there is a Mapper instance but no Reducer, and nothing is being emitted from the Mapper. There job would be defined
- as follows...
-
+
+
+ HBase as a MapReduce Job Data Source and Data Sink
+ HBase can be used as a data source, TableInputFormat,
+ and data sink, TableOutputFormat
+ or MultiTableOutputFormat,
+ for MapReduce jobs. Writing MapReduce jobs that read or write HBase, it is advisable to
+ subclass TableMapper
+ and/or TableReducer.
+ See the do-nothing pass-through classes IdentityTableMapper
+ and IdentityTableReducer
+ for basic usage. For a more involved example, see RowCounter
+ or review the org.apache.hadoop.hbase.mapreduce.TestTableMapReduce unit test.
+ If you run MapReduce jobs that use HBase as source or sink, need to specify source and
+ sink table and column names in your configuration.
+
+ When you read from HBase, the TableInputFormat requests the list of regions
+ from HBase and makes a map, which is either a map-per-region or
+ mapreduce.job.maps map, whichever is smaller. If your job only has two maps,
+ raise mapreduce.job.maps to a number greater than the number of regions. Maps
+ will run on the adjacent TaskTracker if you are running a TaskTracer and RegionServer per
+ node. When writing to HBase, it may make sense to avoid the Reduce step and write back into
+ HBase from within your map. This approach works when your job does not need the sort and
+ collation that MapReduce does on the map-emitted data. On insert, HBase 'sorts' so there is
+ no point double-sorting (and shuffling data around your MapReduce cluster) unless you need
+ to. If you do not need the Reduce, you myour map might emit counts of records processed for
+ reporting at the end of the jobj, or set the number of Reduces to zero and use
+ TableOutputFormat. If running the Reduce step makes sense in your case, you should typically
+ use multiple reducers so that load is spread across the HBase cluster.
+
+ A new HBase partitioner, the HRegionPartitioner,
+ can run as many reducers the number of existing regions. The HRegionPartitioner is suitable
+ when your table is large and your upload will not greatly alter the number of existing
+ regions upon completion. Otherwise use the default partitioner.
+
+
+
+ Writing HFiles Directly During Bulk Import
+ If you are importing into a new table, you can bypass the HBase API and write your
+ content directly to the filesystem, formatted into HBase data files (HFiles). Your import
+ will run faster, perhaps an order of magnitude faster. For more on how this mechanism works,
+ see .
+
+
+
+ RowCounter Example
+ The included RowCounter
+ MapReduce job uses TableInputFormat and does a count of all rows in the specified
+ table. To run it, use the following command:
+ $ ./bin/hadoop jar hbase-X.X.X.jar
+ This will
+ invoke the HBase MapReduce Driver class. Select rowcounter from the choice of jobs
+ offered. This will print rowcouner usage advice to standard output. Specify the tablename,
+ column to count, and output
+ directory. If you have classpath errors, see .
+
+
+
+ Map-Task Splitting
+
+ The Default HBase MapReduce Splitter
+ When TableInputFormat
+ is used to source an HBase table in a MapReduce job, its splitter will make a map task for
+ each region of the table. Thus, if there are 100 regions in the table, there will be 100
+ map-tasks for the job - regardless of how many column families are selected in the
+ Scan.
+
+
+ Custom Splitters
+ For those interested in implementing custom splitters, see the method
+ getSplits in TableInputFormatBase.
+ That is where the logic for map-task assignment resides.
+
+
+
+ HBase MapReduce Examples
+
+ HBase MapReduce Read Example
+ The following is an example of using HBase as a MapReduce source in read-only manner.
+ Specifically, there is a Mapper instance but no Reducer, and nothing is being emitted from
+ the Mapper. There job would be defined as follows...
+
Configuration config = HBaseConfiguration.create();
Job job = new Job(config, "ExampleRead");
job.setJarByClass(MyReadJob.class); // class that contains mapper
@@ -784,8 +937,9 @@ if (!b) {
throw new IOException("error with job!");
}
- ...and the mapper instance would extend TableMapper...
-
+ ...and the mapper instance would extend TableMapper...
+
public static class MyMapper extends TableMapper<Text, Text> {
public void map(ImmutableBytesWritable row, Result value, Context context) throws InterruptedException, IOException {
@@ -793,13 +947,13 @@ public static class MyMapper extends TableMapper<Text, Text> {
}
}
-
-
-
- HBase MapReduce Read/Write Example
- The following is an example of using HBase both as a source and as a sink with MapReduce.
- This example will simply copy data from one table to another.
-
+
+
+ HBase MapReduce Read/Write Example
+ The following is an example of using HBase both as a source and as a sink with
+ MapReduce. This example will simply copy data from one table to another.
+
Configuration config = HBaseConfiguration.create();
Job job = new Job(config,"ExampleReadWrite");
job.setJarByClass(MyReadWriteJob.class); // class that contains mapper
@@ -827,15 +981,18 @@ if (!b) {
throw new IOException("error with job!");
}
- An explanation is required of what TableMapReduceUtil is doing, especially with the reducer.
- TableOutputFormat is being used
- as the outputFormat class, and several parameters are being set on the config (e.g., TableOutputFormat.OUTPUT_TABLE), as
- well as setting the reducer output key to ImmutableBytesWritable and reducer value to Writable.
- These could be set by the programmer on the job and conf, but TableMapReduceUtil tries to make things easier.
- The following is the example mapper, which will create a Put and matching the input Result
- and emit it. Note: this is what the CopyTable utility does.
-
-
+ An explanation is required of what TableMapReduceUtil is doing,
+ especially with the reducer. TableOutputFormat
+ is being used as the outputFormat class, and several parameters are being set on the
+ config (e.g., TableOutputFormat.OUTPUT_TABLE), as well as setting the reducer output key
+ to ImmutableBytesWritable and reducer value to
+ Writable. These could be set by the programmer on the job and
+ conf, but TableMapReduceUtil tries to make things easier.
+ The following is the example mapper, which will create a Put
+ and matching the input Result and emit it. Note: this is what the
+ CopyTable utility does.
+
public static class MyMapper extends TableMapper<ImmutableBytesWritable, Put> {
public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {
@@ -852,23 +1009,24 @@ public static class MyMapper extends TableMapper<ImmutableBytesWritable, Put&
}
}
- There isn't actually a reducer step, so TableOutputFormat takes care of sending the Put
- to the target table.
-
- This is just an example, developers could choose not to use TableOutputFormat and connect to the
- target table themselves.
-
-
-
- HBase MapReduce Read/Write Example With Multi-Table Output
- TODO: example for MultiTableOutputFormat.
-
-
-
- HBase MapReduce Summary to HBase Example
- The following example uses HBase as a MapReduce source and sink with a summarization step. This example will
- count the number of distinct instances of a value in a table and write those summarized counts in another table.
-
+ There isn't actually a reducer step, so TableOutputFormat takes
+ care of sending the Put to the target table.
+ This is just an example, developers could choose not to use
+ TableOutputFormat and connect to the target table themselves.
+
+
+
+ HBase MapReduce Read/Write Example With Multi-Table Output
+ TODO: example for MultiTableOutputFormat.
+
+
+ HBase MapReduce Summary to HBase Example
+ The following example uses HBase as a MapReduce source and sink with a summarization
+ step. This example will count the number of distinct instances of a value in a table and
+ write those summarized counts in another table.
+
Configuration config = HBaseConfiguration.create();
Job job = new Job(config,"ExampleSummary");
job.setJarByClass(MySummaryJob.class); // class that contains mapper and reducer
@@ -896,9 +1054,10 @@ if (!b) {
throw new IOException("error with job!");
}
- In this example mapper a column with a String-value is chosen as the value to summarize upon.
- This value is used as the key to emit from the mapper, and an IntWritable represents an instance counter.
-
+ In this example mapper a column with a String-value is chosen as the value to summarize
+ upon. This value is used as the key to emit from the mapper, and an
+ IntWritable represents an instance counter.
+
public static class MyMapper extends TableMapper<Text, IntWritable> {
public static final byte[] CF = "cf".getBytes();
public static final byte[] ATTR1 = "attr1".getBytes();
@@ -914,8 +1073,9 @@ public static class MyMapper extends TableMapper<Text, IntWritable> {
}
}
- In the reducer, the "ones" are counted (just like any other MR example that does this), and then emits a Put.
-
+ In the reducer, the "ones" are counted (just like any other MR example that does this),
+ and then emits a Put.
+
public static class MyTableReducer extends TableReducer<Text, IntWritable, ImmutableBytesWritable> {
public static final byte[] CF = "cf".getBytes();
public static final byte[] COUNT = "count".getBytes();
@@ -932,14 +1092,15 @@ public static class MyTableReducer extends TableReducer<Text, IntWritable, Im
}
}
-
-
-
- HBase MapReduce Summary to File Example
- This very similar to the summary example above, with exception that this is using HBase as a MapReduce source
- but HDFS as the sink. The differences are in the job setup and in the reducer. The mapper remains the same.
-
-
+
+
+
+ HBase MapReduce Summary to File Example
+ This very similar to the summary example above, with exception that this is using
+ HBase as a MapReduce source but HDFS as the sink. The differences are in the job setup and
+ in the reducer. The mapper remains the same.
+
Configuration config = HBaseConfiguration.create();
Job job = new Job(config,"ExampleSummaryToFile");
job.setJarByClass(MySummaryFileJob.class); // class that contains mapper and reducer
@@ -965,9 +1126,10 @@ if (!b) {
throw new IOException("error with job!");
}
- As stated above, the previous Mapper can run unchanged with this example.
- As for the Reducer, it is a "generic" Reducer instead of extending TableMapper and emitting Puts.
-
+ As stated above, the previous Mapper can run unchanged with this example. As for the
+ Reducer, it is a "generic" Reducer instead of extending TableMapper and emitting
+ Puts.
+
public static class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
@@ -979,33 +1141,35 @@ if (!b) {
}
}
-
-
- HBase MapReduce Summary to HBase Without Reducer
- It is also possible to perform summaries without a reducer - if you use HBase as the reducer.
-
- An HBase target table would need to exist for the job summary. The HTable method incrementColumnValue
- would be used to atomically increment values. From a performance perspective, it might make sense to keep a Map
- of values with their values to be incremeneted for each map-task, and make one update per key at during the
- cleanup method of the mapper. However, your milage may vary depending on the number of rows to be processed and
- unique keys.
-
- In the end, the summary results are in HBase.
-
-
-
- HBase MapReduce Summary to RDBMS
- Sometimes it is more appropriate to generate summaries to an RDBMS. For these cases, it is possible
- to generate summaries directly to an RDBMS via a custom reducer. The setup method
- can connect to an RDBMS (the connection information can be passed via custom parameters in the context) and the
- cleanup method can close the connection.
-
- It is critical to understand that number of reducers for the job affects the summarization implementation, and
- you'll have to design this into your reducer. Specifically, whether it is designed to run as a singleton (one reducer)
- or multiple reducers. Neither is right or wrong, it depends on your use-case. Recognize that the more reducers that
- are assigned to the job, the more simultaneous connections to the RDBMS will be created - this will scale, but only to a point.
-
-
+
+
+ HBase MapReduce Summary to HBase Without Reducer
+ It is also possible to perform summaries without a reducer - if you use HBase as the
+ reducer.
+ An HBase target table would need to exist for the job summary. The HTable method
+ incrementColumnValue would be used to atomically increment values. From a
+ performance perspective, it might make sense to keep a Map of values with their values to
+ be incremeneted for each map-task, and make one update per key at during the
+ cleanup method of the mapper. However, your milage may vary depending on the
+ number of rows to be processed and unique keys.
+ In the end, the summary results are in HBase.
+
+
+ HBase MapReduce Summary to RDBMS
+ Sometimes it is more appropriate to generate summaries to an RDBMS. For these cases,
+ it is possible to generate summaries directly to an RDBMS via a custom reducer. The
+ setup method can connect to an RDBMS (the connection information can be
+ passed via custom parameters in the context) and the cleanup method can close the
+ connection.
+ It is critical to understand that number of reducers for the job affects the
+ summarization implementation, and you'll have to design this into your reducer.
+ Specifically, whether it is designed to run as a singleton (one reducer) or multiple
+ reducers. Neither is right or wrong, it depends on your use-case. Recognize that the more
+ reducers that are assigned to the job, the more simultaneous connections to the RDBMS will
+ be created - this will scale, but only to a point.
+
public static class MyRdbmsReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private Connection c = null;
@@ -1025,18 +1189,18 @@ if (!b) {
}
- In the end, the summary results are written to your RDBMS table/s.
-
-
+ In the end, the summary results are written to your RDBMS table/s.
+
-
-
- Accessing Other HBase Tables in a MapReduce Job
- Although the framework currently allows one HBase table as input to a
- MapReduce job, other HBase tables can
- be accessed as lookup tables, etc., in a
- MapReduce job via creating an HTable instance in the setup method of the Mapper.
- public class MyMapper extends TableMapper<Text, LongWritable> {
+
+
+
+ Accessing Other HBase Tables in a MapReduce Job
+ Although the framework currently allows one HBase table as input to a MapReduce job,
+ other HBase tables can be accessed as lookup tables, etc., in a MapReduce job via creating
+ an HTable instance in the setup method of the Mapper.
+ public class MyMapper extends TableMapper<Text, LongWritable> {
private HTable myOtherTable;
public void setup(Context context) {
@@ -1049,20 +1213,19 @@ if (!b) {
}
-
+
+
+
+ Speculative Execution
+ It is generally advisable to turn off speculative execution for MapReduce jobs that use
+ HBase as a source. This can either be done on a per-Job basis through properties, on on the
+ entire cluster. Especially for longer running jobs, speculative execution will create
+ duplicate map-tasks which will double-write your data to HBase; this is probably not what
+ you want.
+ See for more information.
-
- Speculative Execution
- It is generally advisable to turn off speculative execution for
- MapReduce jobs that use HBase as a source. This can either be done on a
- per-Job basis through properties, on on the entire cluster. Especially
- for longer running jobs, speculative execution will create duplicate
- map-tasks which will double-write your data to HBase; this is probably
- not what you want.
-
- See for more information.
-
-