From ef995efb1aac7fee1d23639128ecb1d85f5a3bec Mon Sep 17 00:00:00 2001 From: Michael Stack Date: Tue, 27 May 2014 15:13:23 -0700 Subject: [PATCH] HBASE-8807 HBase MapReduce Job-Launch Documentation Misplaced (Misty Stanley-Jones) --- .../hadoop/hbase/mapred/package-info.java | 101 +--- .../hadoop/hbase/mapreduce/package-info.java | 141 +---- src/main/docbkx/book.xml | 501 ++++++++++++------ 3 files changed, 336 insertions(+), 407 deletions(-) diff --git a/hbase-server/src/main/java/org/apache/hadoop/hbase/mapred/package-info.java b/hbase-server/src/main/java/org/apache/hadoop/hbase/mapred/package-info.java index 5a51d6c76b0..8ff35a31d83 100644 --- a/hbase-server/src/main/java/org/apache/hadoop/hbase/mapred/package-info.java +++ b/hbase-server/src/main/java/org/apache/hadoop/hbase/mapred/package-info.java @@ -20,104 +20,7 @@ Provides HBase MapReduce Input/OutputFormats, a table indexing MapReduce job, and utility -

Table of Contents

- - -

HBase, MapReduce and the CLASSPATH

- -

MapReduce jobs deployed to a MapReduce cluster do not by default have access -to the HBase configuration under $HBASE_CONF_DIR nor to HBase classes. -You could add hbase-site.xml to $HADOOP_HOME/conf and add -hbase-X.X.X.jar to the $HADOOP_HOME/lib and copy these -changes across your cluster but the cleanest means of adding hbase configuration -and classes to the cluster CLASSPATH is by uncommenting -HADOOP_CLASSPATH in $HADOOP_HOME/conf/hadoop-env.sh -adding hbase dependencies here. For example, here is how you would amend -hadoop-env.sh adding the -built hbase jar, zookeeper (needed by hbase client), hbase conf, and the -PerformanceEvaluation class from the built hbase test jar to the -hadoop CLASSPATH: - -

# Extra Java CLASSPATH elements. Optional.
-# export HADOOP_CLASSPATH=
-export HADOOP_CLASSPATH=$HBASE_HOME/build/hbase-X.X.X.jar:$HBASE_HOME/build/hbase-X.X.X-test.jar:$HBASE_HOME/conf:${HBASE_HOME}/lib/zookeeper-X.X.X.jar
- -

Expand $HBASE_HOME in the above appropriately to suit your -local environment.

- -

After copying the above change around your cluster (and restarting), this is -how you would run the PerformanceEvaluation MR job to put up 4 clients (Presumes -a ready mapreduce cluster): - -

$HADOOP_HOME/bin/hadoop org.apache.hadoop.hbase.PerformanceEvaluation sequentialWrite 4
- -The PerformanceEvaluation class wil be found on the CLASSPATH because you -added $HBASE_HOME/build/test to HADOOP_CLASSPATH -

- -

Another possibility, if for example you do not have access to hadoop-env.sh or -are unable to restart the hadoop cluster, is bundling the hbase jar into a mapreduce -job jar adding it and its dependencies under the job jar lib/ -directory and the hbase conf into a job jar conf/ directory. - - -

HBase as MapReduce job data source and sink

- -

HBase can be used as a data source, {@link org.apache.hadoop.hbase.mapred.TableInputFormat TableInputFormat}, -and data sink, {@link org.apache.hadoop.hbase.mapred.TableOutputFormat TableOutputFormat}, for MapReduce jobs. -Writing MapReduce jobs that read or write HBase, you'll probably want to subclass -{@link org.apache.hadoop.hbase.mapred.TableMap TableMap} and/or -{@link org.apache.hadoop.hbase.mapred.TableReduce TableReduce}. See the do-nothing -pass-through classes {@link org.apache.hadoop.hbase.mapred.IdentityTableMap IdentityTableMap} and -{@link org.apache.hadoop.hbase.mapred.IdentityTableReduce IdentityTableReduce} for basic usage. For a more -involved example, see BuildTableIndex -or review the org.apache.hadoop.hbase.mapred.TestTableMapReduce unit test. -

- -

Running mapreduce jobs that have hbase as source or sink, you'll need to -specify source/sink table and column names in your configuration.

- -

Reading from hbase, the TableInputFormat asks hbase for the list of -regions and makes a map-per-region or mapred.map.tasks maps, -whichever is smaller (If your job only has two maps, up mapred.map.tasks -to a number > number of regions). Maps will run on the adjacent TaskTracker -if you are running a TaskTracer and RegionServer per node. -Writing, it may make sense to avoid the reduce step and write yourself back into -hbase from inside your map. You'd do this when your job does not need the sort -and collation that mapreduce does on the map emitted data; on insert, -hbase 'sorts' so there is no point double-sorting (and shuffling data around -your mapreduce cluster) unless you need to. If you do not need the reduce, -you might just have your map emit counts of records processed just so the -framework's report at the end of your job has meaning or set the number of -reduces to zero and use TableOutputFormat. See example code -below. If running the reduce step makes sense in your case, its usually better -to have lots of reducers so load is spread across the hbase cluster.

- -

There is also a new hbase partitioner that will run as many reducers as -currently existing regions. The -{@link org.apache.hadoop.hbase.mapred.HRegionPartitioner} is suitable -when your table is large and your upload is not such that it will greatly -alter the number of existing regions when done; other use the default -partitioner. -

- -

Example Code

-

Sample Row Counter

-

See {@link org.apache.hadoop.hbase.mapred.RowCounter}. You should be able to run -it by doing: % ./bin/hadoop jar hbase-X.X.X.jar. This will invoke -the hbase MapReduce Driver class. Select 'rowcounter' from the choice of jobs -offered. You may need to add the hbase conf directory to $HADOOP_HOME/conf/hadoop-env.sh#HADOOP_CLASSPATH -so the rowcounter gets pointed at the right hbase cluster (or, build a new jar -with an appropriate hbase-site.xml built into your job jar). -

-

PerformanceEvaluation

-

See org.apache.hadoop.hbase.PerformanceEvaluation from hbase src/test. It runs -a mapreduce job to run concurrent clients reading and writing hbase. -

- +

See HBase and MapReduce +in the HBase Reference Guide for mapreduce over hbase documentation. */ package org.apache.hadoop.hbase.mapred; diff --git a/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/package-info.java b/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/package-info.java index 11ba5ac5923..f811e210008 100644 --- a/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/package-info.java +++ b/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/package-info.java @@ -20,144 +20,7 @@ Provides HBase MapReduce Input/OutputFormats, a table indexing MapReduce job, and utility -

Table of Contents

- - -

HBase, MapReduce and the CLASSPATH

- -

MapReduce jobs deployed to a MapReduce cluster do not by default have access -to the HBase configuration under $HBASE_CONF_DIR nor to HBase classes. -You could add hbase-site.xml to -$HADOOP_HOME/conf and add -HBase jars to the $HADOOP_HOME/lib and copy these -changes across your cluster (or edit conf/hadoop-env.sh and add them to the -HADOOP_CLASSPATH variable) but this will pollute your -hadoop install with HBase references; its also obnoxious requiring restart of -the hadoop cluster before it'll notice your HBase additions.

- -

As of 0.90.x, HBase will just add its dependency jars to the job -configuration; the dependencies just need to be available on the local -CLASSPATH. For example, to run the bundled HBase -{@link org.apache.hadoop.hbase.mapreduce.RowCounter} mapreduce job against a table named usertable, -type: - -

-$ HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-0.90.0.jar rowcounter usertable
-
- -Expand $HBASE_HOME and $HADOOP_HOME in the above -appropriately to suit your local environment. The content of HADOOP_CLASSPATH -is set to the HBase CLASSPATH via backticking the command -${HBASE_HOME}/bin/hbase classpath. - -

When the above runs, internally, the HBase jar finds its zookeeper and -guava, -etc., dependencies on the passed -HADOOP_CLASSPATH and adds the found jars to the mapreduce -job configuration. See the source at -TableMapReduceUtil#addDependencyJars(org.apache.hadoop.mapreduce.Job) -for how this is done. -

-

The above may not work if you are running your HBase from its build directory; -i.e. you've done $ mvn test install at -${HBASE_HOME} and you are now -trying to use this build in your mapreduce job. If you get -

java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.mapreduce.RowCounter$RowCounterMapper
-...
-
-exception thrown, try doing the following: -
-$ HADOOP_CLASSPATH=${HBASE_HOME}/target/hbase-0.90.0-SNAPSHOT.jar:`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/target/hbase-0.90.0-SNAPSHOT.jar rowcounter usertable
-
-Notice how we preface the backtick invocation setting -HADOOP_CLASSPATH with reference to the built HBase jar over in -the target directory. -

- -

Bundled HBase MapReduce Jobs

-

The HBase jar also serves as a Driver for some bundled mapreduce jobs. To -learn about the bundled mapreduce jobs run: -

-$ ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-0.90.0-SNAPSHOT.jar
-An example program must be given as the first argument.
-Valid program names are:
-  copytable: Export a table from local cluster to peer cluster
-  completebulkload: Complete a bulk data load.
-  export: Write table data to HDFS.
-  import: Import data written by Export.
-  importtsv: Import data in TSV format.
-  rowcounter: Count rows in HBase table
-
- -

HBase as MapReduce job data source and sink

- -

HBase can be used as a data source, {@link org.apache.hadoop.hbase.mapreduce.TableInputFormat TableInputFormat}, -and data sink, {@link org.apache.hadoop.hbase.mapreduce.TableOutputFormat TableOutputFormat} -or {@link org.apache.hadoop.hbase.mapreduce.MultiTableOutputFormat MultiTableOutputFormat}, -for MapReduce jobs. -Writing MapReduce jobs that read or write HBase, you'll probably want to subclass -{@link org.apache.hadoop.hbase.mapreduce.TableMapper TableMapper} and/or -{@link org.apache.hadoop.hbase.mapreduce.TableReducer TableReducer}. See the do-nothing -pass-through classes {@link org.apache.hadoop.hbase.mapreduce.IdentityTableMapper IdentityTableMapper} and -{@link org.apache.hadoop.hbase.mapreduce.IdentityTableReducer IdentityTableReducer} for basic usage. For a more -involved example, see {@link org.apache.hadoop.hbase.mapreduce.RowCounter} -or review the org.apache.hadoop.hbase.mapreduce.TestTableMapReduce unit test. -

- -

Running mapreduce jobs that have HBase as source or sink, you'll need to -specify source/sink table and column names in your configuration.

- -

Reading from HBase, the TableInputFormat asks HBase for the list of -regions and makes a map-per-region or mapreduce.job.maps maps, -whichever is smaller (If your job only has two maps, up mapreduce.job.maps -to a number > number of regions). Maps will run on the adjacent TaskTracker -if you are running a TaskTracer and RegionServer per node. -Writing, it may make sense to avoid the reduce step and write yourself back into -HBase from inside your map. You'd do this when your job does not need the sort -and collation that mapreduce does on the map emitted data; on insert, -HBase 'sorts' so there is no point double-sorting (and shuffling data around -your mapreduce cluster) unless you need to. If you do not need the reduce, -you might just have your map emit counts of records processed just so the -framework's report at the end of your job has meaning or set the number of -reduces to zero and use TableOutputFormat. See example code -below. If running the reduce step makes sense in your case, its usually better -to have lots of reducers so load is spread across the HBase cluster.

- -

There is also a new HBase partitioner that will run as many reducers as -currently existing regions. The -{@link org.apache.hadoop.hbase.mapreduce.HRegionPartitioner} is suitable -when your table is large and your upload is not such that it will greatly -alter the number of existing regions when done; otherwise use the default -partitioner. -

- -

Bulk import writing HFiles directly

-

If importing into a new table, its possible to by-pass the HBase API -and write your content directly to the filesystem properly formatted as -HBase data files (HFiles). Your import will run faster, perhaps an order of -magnitude faster if not more. For more on how this mechanism works, see -Bulk Loads -documentation. -

- -

Example Code

-

Sample Row Counter

-

See {@link org.apache.hadoop.hbase.mapreduce.RowCounter}. This job uses -{@link org.apache.hadoop.hbase.mapreduce.TableInputFormat TableInputFormat} and -does a count of all rows in specified table. -You should be able to run -it by doing: % ./bin/hadoop jar hbase-X.X.X.jar. This will invoke -the hbase MapReduce Driver class. Select 'rowcounter' from the choice of jobs -offered. This will emit rowcouner 'usage'. Specify tablename, column to count -and output directory. You may need to add the hbase conf directory to $HADOOP_HOME/conf/hadoop-env.sh#HADOOP_CLASSPATH -so the rowcounter gets pointed at the right hbase cluster (or, build a new jar -with an appropriate hbase-site.xml built into your job jar). -

+

See HBase and MapReduce +in the HBase Reference Guide for mapreduce over hbase documentation. */ package org.apache.hadoop.hbase.mapreduce; diff --git a/src/main/docbkx/book.xml b/src/main/docbkx/book.xml index cb84c21badb..1fca2be7985 100644 --- a/src/main/docbkx/book.xml +++ b/src/main/docbkx/book.xml @@ -664,18 +664,76 @@ htable.put(put); - - HBase and MapReduce - See - HBase and MapReduce up in javadocs. - Start there. Below is some additional help. - For more information about MapReduce (i.e., the framework in general), see the Hadoop site (TODO: Need good links here -- - we used to have some but they rotted against apache hadoop). - - Notice to Mapreduce users of HBase 0.96.1 and above - Some mapreduce jobs that use HBase fail to launch. The symptom is an - exception similar to the following: - + + HBase and MapReduce + Apache MapReduce is a software framework used to analyze large amounts of data, and is + the framework used most often with Apache Hadoop. MapReduce itself is out of the + scope of this document. A good place to get started with MapReduce is . MapReduce version + 2 (MR2)is now part of YARN. + + This chapter discusses specific configuration steps you need to take to use MapReduce on + data within HBase. In addition, it discusses other interactions and issues between HBase and + MapReduce jobs. + + mapred and mapreduce + There are two mapreduce packages in HBase as in MapReduce itself: org.apache.hadoop.hbase.mapred + and org.apache.hadoop.hbase.mapreduce. The former does old-style API and the latter + the new style. The latter has more facility though you can usually find an equivalent in the older + package. Pick the package that goes with your mapreduce deploy. When in doubt or starting over, pick the + org.apache.hadoop.hbase.mapreduce. In the notes below, we refer to + o.a.h.h.mapreduce but replace with the o.a.h.h.mapred if that is what you are using. + + + + +

+ HBase, MapReduce, and the CLASSPATH + Ny default, MapReduce jobs deployed to a MapReduce cluster do not have access to either + the HBase configuration under $HBASE_CONF_DIR or the HBase classes. + To give the MapReduce jobs the access they need, you could add + hbase-site.xml to the + $HADOOP_HOME/conf/ directory and add the + HBase JARs to the HADOOP_HOME/conf/ + directory, then copy these changes across your cluster. You could add hbase-site.xml to + $HADOOP_HOME/conf and add HBase jars to the $HADOOP_HOME/lib. You would then need to copy + these changes across your cluster or edit + $HADOOP_HOMEconf/hadoop-env.sh and add + them to the HADOOP_CLASSPATH variable. However, this approach is not + recommended because it will pollute your Hadoop install with HBase references. It also + requires you to restart the Hadoop cluster before Hadoop can use the HBase data. + Since HBase 0.90.x, HBase adds its dependency JARs to the job configuration itself. The + dependencies only need to be available on the local CLASSPATH. The following example runs + the bundled HBase RowCounter + MapReduce job against a table named usertable If you have not set + the environment variables expected in the command (the parts prefixed by a + $ sign and curly braces), you can use the actual system paths instead. + Be sure to use the correct version of the HBase JAR for your system. The backticks + (` symbols) cause ths shell to execute the sub-commands, setting the + CLASSPATH as part of the command. This example assumes you use a BASH-compatible shell. + $ HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-0.90.0.jar rowcounter usertable + When the command runs, internally, the HBase JAR finds the dependencies it needs for + zookeeper, guava, and its other dependencies on the passed HADOOP_CLASSPATH + and adds the JARs to the MapReduce job configuration. See the source at + TableMapReduceUtil#addDependencyJars(org.apache.hadoop.mapreduce.Job) for how this is done. + + The example may not work if you are running HBase from its build directory rather + than an installed location. You may see an error like the following: + java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.mapreduce.RowCounter$RowCounterMapper + If this occurs, try modifying the command as follows, so that it uses the HBase JARs + from the target/ directory within the build environment. + $ HADOOP_CLASSPATH=${HBASE_HOME}/target/hbase-0.90.0-SNAPSHOT.jar:`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/target/hbase-0.90.0-SNAPSHOT.jar rowcounter usertable + + + Notice to Mapreduce users of HBase 0.96.1 and above + Some mapreduce jobs that use HBase fail to launch. The symptom is an exception similar + to the following: + Exception in thread "main" java.lang.IllegalAccessError: class com.google.protobuf.ZeroCopyLiteralByteString cannot access its superclass com.google.protobuf.LiteralByteString @@ -703,63 +761,158 @@ Exception in thread "main" java.lang.IllegalAccessError: class at org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:100) ... - - This is because of an optimization introduced in HBASE-9867 - that inadvertently introduced a classloader dependency. - - This affects both jobs using the -libjars option and - "fat jar," those which package their runtime dependencies in a nested - lib folder. - In order to satisfy the new classloader requirements, - hbase-protocol.jar must be included in Hadoop's classpath. This can be - resolved system-wide by including a reference to the hbase-protocol.jar in - hadoop's lib directory, via a symlink or by copying the jar into the new - location. - This can also be achieved on a per-job launch basis by including it - in the HADOOP_CLASSPATH environment variable at job submission - time. When launching jobs that package their dependencies, all three of the - following job launching commands satisfy this requirement: - -$ HADOOP_CLASSPATH=/path/to/hbase-protocol.jar:/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass -$ HADOOP_CLASSPATH=$(hbase mapredcp):/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass -$ HADOOP_CLASSPATH=$(hbase classpath) hadoop jar MyJob.jar MyJobMainClass - - For jars that do not package their dependencies, the following command - structure is necessary: - -$ HADOOP_CLASSPATH=$(hbase mapredcp):/etc/hbase/conf hadoop jar MyApp.jar MyJobMainClass -libjars $(hbase mapredcp | tr ':' ',') ... - - See also HBASE-10304 - for further discussion of this issue. - -
- Map-Task Splitting -
- The Default HBase MapReduce Splitter - When TableInputFormat - is used to source an HBase table in a MapReduce job, - its splitter will make a map task for each region of the table. - Thus, if there are 100 regions in the table, there will be - 100 map-tasks for the job - regardless of how many column families are selected in the Scan. + + This is caused by an optimization introduced in HBASE-9867 that + inadvertently introduced a classloader dependency. + This affects both jobs using the -libjars option and "fat jar," those + which package their runtime dependencies in a nested lib folder. + In order to satisfy the new classloader requirements, hbase-protocol.jar must be + included in Hadoop's classpath. See for current recommendations for resolving + classpath errors. The following is included for historical purposes. + This can be resolved system-wide by including a reference to the hbase-protocol.jar in + hadoop's lib directory, via a symlink or by copying the jar into the new location. + This can also be achieved on a per-job launch basis by including it in the + HADOOP_CLASSPATH environment variable at job submission time. When + launching jobs that package their dependencies, all three of the following job launching + commands satisfy this requirement: + +$ HADOOP_CLASSPATH=/path/to/hbase-protocol.jar:/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass +$ HADOOP_CLASSPATH=$(hbase mapredcp):/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass +$ HADOOP_CLASSPATH=$(hbase classpath) hadoop jar MyJob.jar MyJobMainClass + + For jars that do not package their dependencies, the following command structure is + necessary: + +$ HADOOP_CLASSPATH=$(hbase mapredcp):/etc/hbase/conf hadoop jar MyApp.jar MyJobMainClass -libjars $(hbase mapredcp | tr ':' ',') ... + + See also HBASE-10304 for + further discussion of this issue. +
-
- Custom Splitters - For those interested in implementing custom splitters, see the method getSplits in - TableInputFormatBase. - That is where the logic for map-task assignment resides. - + + +
+ Bundled HBase MapReduce Jobs + The HBase JAR also serves as a Driver for some bundled mapreduce jobs. To learn about + the bundled MapReduce jobs, run the following command. + + $ ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-0.90.0-SNAPSHOT.jar +An example program must be given as the first argument. +Valid program names are: + copytable: Export a table from local cluster to peer cluster + completebulkload: Complete a bulk data load. + export: Write table data to HDFS. + import: Import data written by Export. + importtsv: Import data in TSV format. + rowcounter: Count rows in HBase table + + Each of the valid program names are bundled MapReduce jobs. To run one of the jobs, + model your command after the following example. + $ ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-0.90.0-SNAPSHOT.jar rowcounter myTable
-
-
- HBase MapReduce Examples -
- HBase MapReduce Read Example - The following is an example of using HBase as a MapReduce source in read-only manner. Specifically, - there is a Mapper instance but no Reducer, and nothing is being emitted from the Mapper. There job would be defined - as follows... - + +
+ HBase as a MapReduce Job Data Source and Data Sink + HBase can be used as a data source, TableInputFormat, + and data sink, TableOutputFormat + or MultiTableOutputFormat, + for MapReduce jobs. Writing MapReduce jobs that read or write HBase, it is advisable to + subclass TableMapper + and/or TableReducer. + See the do-nothing pass-through classes IdentityTableMapper + and IdentityTableReducer + for basic usage. For a more involved example, see RowCounter + or review the org.apache.hadoop.hbase.mapreduce.TestTableMapReduce unit test. + If you run MapReduce jobs that use HBase as source or sink, need to specify source and + sink table and column names in your configuration. + + When you read from HBase, the TableInputFormat requests the list of regions + from HBase and makes a map, which is either a map-per-region or + mapreduce.job.maps map, whichever is smaller. If your job only has two maps, + raise mapreduce.job.maps to a number greater than the number of regions. Maps + will run on the adjacent TaskTracker if you are running a TaskTracer and RegionServer per + node. When writing to HBase, it may make sense to avoid the Reduce step and write back into + HBase from within your map. This approach works when your job does not need the sort and + collation that MapReduce does on the map-emitted data. On insert, HBase 'sorts' so there is + no point double-sorting (and shuffling data around your MapReduce cluster) unless you need + to. If you do not need the Reduce, you myour map might emit counts of records processed for + reporting at the end of the jobj, or set the number of Reduces to zero and use + TableOutputFormat. If running the Reduce step makes sense in your case, you should typically + use multiple reducers so that load is spread across the HBase cluster. + + A new HBase partitioner, the HRegionPartitioner, + can run as many reducers the number of existing regions. The HRegionPartitioner is suitable + when your table is large and your upload will not greatly alter the number of existing + regions upon completion. Otherwise use the default partitioner. +
+ +
+ Writing HFiles Directly During Bulk Import + If you are importing into a new table, you can bypass the HBase API and write your + content directly to the filesystem, formatted into HBase data files (HFiles). Your import + will run faster, perhaps an order of magnitude faster. For more on how this mechanism works, + see . +
+ +
+ RowCounter Example + The included RowCounter + MapReduce job uses TableInputFormat and does a count of all rows in the specified + table. To run it, use the following command: + $ ./bin/hadoop jar hbase-X.X.X.jar + This will + invoke the HBase MapReduce Driver class. Select rowcounter from the choice of jobs + offered. This will print rowcouner usage advice to standard output. Specify the tablename, + column to count, and output + directory. If you have classpath errors, see . +
+ +
+ Map-Task Splitting +
+ The Default HBase MapReduce Splitter + When TableInputFormat + is used to source an HBase table in a MapReduce job, its splitter will make a map task for + each region of the table. Thus, if there are 100 regions in the table, there will be 100 + map-tasks for the job - regardless of how many column families are selected in the + Scan. +
+
+ Custom Splitters + For those interested in implementing custom splitters, see the method + getSplits in TableInputFormatBase. + That is where the logic for map-task assignment resides. +
+
+
+ HBase MapReduce Examples +
+ HBase MapReduce Read Example + The following is an example of using HBase as a MapReduce source in read-only manner. + Specifically, there is a Mapper instance but no Reducer, and nothing is being emitted from + the Mapper. There job would be defined as follows... + Configuration config = HBaseConfiguration.create(); Job job = new Job(config, "ExampleRead"); job.setJarByClass(MyReadJob.class); // class that contains mapper @@ -784,8 +937,9 @@ if (!b) { throw new IOException("error with job!"); } - ...and the mapper instance would extend TableMapper... - + ...and the mapper instance would extend TableMapper... + public static class MyMapper extends TableMapper<Text, Text> { public void map(ImmutableBytesWritable row, Result value, Context context) throws InterruptedException, IOException { @@ -793,13 +947,13 @@ public static class MyMapper extends TableMapper<Text, Text> { } } - -
-
- HBase MapReduce Read/Write Example - The following is an example of using HBase both as a source and as a sink with MapReduce. - This example will simply copy data from one table to another. - +
+
+ HBase MapReduce Read/Write Example + The following is an example of using HBase both as a source and as a sink with + MapReduce. This example will simply copy data from one table to another. + Configuration config = HBaseConfiguration.create(); Job job = new Job(config,"ExampleReadWrite"); job.setJarByClass(MyReadWriteJob.class); // class that contains mapper @@ -827,15 +981,18 @@ if (!b) { throw new IOException("error with job!"); } - An explanation is required of what TableMapReduceUtil is doing, especially with the reducer. - TableOutputFormat is being used - as the outputFormat class, and several parameters are being set on the config (e.g., TableOutputFormat.OUTPUT_TABLE), as - well as setting the reducer output key to ImmutableBytesWritable and reducer value to Writable. - These could be set by the programmer on the job and conf, but TableMapReduceUtil tries to make things easier. - The following is the example mapper, which will create a Put and matching the input Result - and emit it. Note: this is what the CopyTable utility does. - - + An explanation is required of what TableMapReduceUtil is doing, + especially with the reducer. TableOutputFormat + is being used as the outputFormat class, and several parameters are being set on the + config (e.g., TableOutputFormat.OUTPUT_TABLE), as well as setting the reducer output key + to ImmutableBytesWritable and reducer value to + Writable. These could be set by the programmer on the job and + conf, but TableMapReduceUtil tries to make things easier. + The following is the example mapper, which will create a Put + and matching the input Result and emit it. Note: this is what the + CopyTable utility does. + public static class MyMapper extends TableMapper<ImmutableBytesWritable, Put> { public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException { @@ -852,23 +1009,24 @@ public static class MyMapper extends TableMapper<ImmutableBytesWritable, Put& } } - There isn't actually a reducer step, so TableOutputFormat takes care of sending the Put - to the target table. - - This is just an example, developers could choose not to use TableOutputFormat and connect to the - target table themselves. - -
-
- HBase MapReduce Read/Write Example With Multi-Table Output - TODO: example for MultiTableOutputFormat. - -
-
- HBase MapReduce Summary to HBase Example - The following example uses HBase as a MapReduce source and sink with a summarization step. This example will - count the number of distinct instances of a value in a table and write those summarized counts in another table. - + There isn't actually a reducer step, so TableOutputFormat takes + care of sending the Put to the target table. + This is just an example, developers could choose not to use + TableOutputFormat and connect to the target table themselves. + +
+
+ HBase MapReduce Read/Write Example With Multi-Table Output + TODO: example for MultiTableOutputFormat. +
+
+ HBase MapReduce Summary to HBase Example + The following example uses HBase as a MapReduce source and sink with a summarization + step. This example will count the number of distinct instances of a value in a table and + write those summarized counts in another table. + Configuration config = HBaseConfiguration.create(); Job job = new Job(config,"ExampleSummary"); job.setJarByClass(MySummaryJob.class); // class that contains mapper and reducer @@ -896,9 +1054,10 @@ if (!b) { throw new IOException("error with job!"); } - In this example mapper a column with a String-value is chosen as the value to summarize upon. - This value is used as the key to emit from the mapper, and an IntWritable represents an instance counter. - + In this example mapper a column with a String-value is chosen as the value to summarize + upon. This value is used as the key to emit from the mapper, and an + IntWritable represents an instance counter. + public static class MyMapper extends TableMapper<Text, IntWritable> { public static final byte[] CF = "cf".getBytes(); public static final byte[] ATTR1 = "attr1".getBytes(); @@ -914,8 +1073,9 @@ public static class MyMapper extends TableMapper<Text, IntWritable> { } } - In the reducer, the "ones" are counted (just like any other MR example that does this), and then emits a Put. - + In the reducer, the "ones" are counted (just like any other MR example that does this), + and then emits a Put. + public static class MyTableReducer extends TableReducer<Text, IntWritable, ImmutableBytesWritable> { public static final byte[] CF = "cf".getBytes(); public static final byte[] COUNT = "count".getBytes(); @@ -932,14 +1092,15 @@ public static class MyTableReducer extends TableReducer<Text, IntWritable, Im } } - -
-
- HBase MapReduce Summary to File Example - This very similar to the summary example above, with exception that this is using HBase as a MapReduce source - but HDFS as the sink. The differences are in the job setup and in the reducer. The mapper remains the same. - - + +
+
+ HBase MapReduce Summary to File Example + This very similar to the summary example above, with exception that this is using + HBase as a MapReduce source but HDFS as the sink. The differences are in the job setup and + in the reducer. The mapper remains the same. + Configuration config = HBaseConfiguration.create(); Job job = new Job(config,"ExampleSummaryToFile"); job.setJarByClass(MySummaryFileJob.class); // class that contains mapper and reducer @@ -965,9 +1126,10 @@ if (!b) { throw new IOException("error with job!"); } - As stated above, the previous Mapper can run unchanged with this example. - As for the Reducer, it is a "generic" Reducer instead of extending TableMapper and emitting Puts. - + As stated above, the previous Mapper can run unchanged with this example. As for the + Reducer, it is a "generic" Reducer instead of extending TableMapper and emitting + Puts. + public static class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { @@ -979,33 +1141,35 @@ if (!b) { } } -
-
- HBase MapReduce Summary to HBase Without Reducer - It is also possible to perform summaries without a reducer - if you use HBase as the reducer. - - An HBase target table would need to exist for the job summary. The HTable method incrementColumnValue - would be used to atomically increment values. From a performance perspective, it might make sense to keep a Map - of values with their values to be incremeneted for each map-task, and make one update per key at during the - cleanup method of the mapper. However, your milage may vary depending on the number of rows to be processed and - unique keys. - - In the end, the summary results are in HBase. - -
-
- HBase MapReduce Summary to RDBMS - Sometimes it is more appropriate to generate summaries to an RDBMS. For these cases, it is possible - to generate summaries directly to an RDBMS via a custom reducer. The setup method - can connect to an RDBMS (the connection information can be passed via custom parameters in the context) and the - cleanup method can close the connection. - - It is critical to understand that number of reducers for the job affects the summarization implementation, and - you'll have to design this into your reducer. Specifically, whether it is designed to run as a singleton (one reducer) - or multiple reducers. Neither is right or wrong, it depends on your use-case. Recognize that the more reducers that - are assigned to the job, the more simultaneous connections to the RDBMS will be created - this will scale, but only to a point. - - +
+
+ HBase MapReduce Summary to HBase Without Reducer + It is also possible to perform summaries without a reducer - if you use HBase as the + reducer. + An HBase target table would need to exist for the job summary. The HTable method + incrementColumnValue would be used to atomically increment values. From a + performance perspective, it might make sense to keep a Map of values with their values to + be incremeneted for each map-task, and make one update per key at during the + cleanup method of the mapper. However, your milage may vary depending on the + number of rows to be processed and unique keys. + In the end, the summary results are in HBase. +
+
+ HBase MapReduce Summary to RDBMS + Sometimes it is more appropriate to generate summaries to an RDBMS. For these cases, + it is possible to generate summaries directly to an RDBMS via a custom reducer. The + setup method can connect to an RDBMS (the connection information can be + passed via custom parameters in the context) and the cleanup method can close the + connection. + It is critical to understand that number of reducers for the job affects the + summarization implementation, and you'll have to design this into your reducer. + Specifically, whether it is designed to run as a singleton (one reducer) or multiple + reducers. Neither is right or wrong, it depends on your use-case. Recognize that the more + reducers that are assigned to the job, the more simultaneous connections to the RDBMS will + be created - this will scale, but only to a point. + public static class MyRdbmsReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private Connection c = null; @@ -1025,18 +1189,18 @@ if (!b) { } - In the end, the summary results are written to your RDBMS table/s. - -
+ In the end, the summary results are written to your RDBMS table/s. +
-
-
- Accessing Other HBase Tables in a MapReduce Job - Although the framework currently allows one HBase table as input to a - MapReduce job, other HBase tables can - be accessed as lookup tables, etc., in a - MapReduce job via creating an HTable instance in the setup method of the Mapper. - public class MyMapper extends TableMapper<Text, LongWritable> { +
+ +
+ Accessing Other HBase Tables in a MapReduce Job + Although the framework currently allows one HBase table as input to a MapReduce job, + other HBase tables can be accessed as lookup tables, etc., in a MapReduce job via creating + an HTable instance in the setup method of the Mapper. + public class MyMapper extends TableMapper<Text, LongWritable> { private HTable myOtherTable; public void setup(Context context) { @@ -1049,20 +1213,19 @@ if (!b) { } - + +
+
+ Speculative Execution + It is generally advisable to turn off speculative execution for MapReduce jobs that use + HBase as a source. This can either be done on a per-Job basis through properties, on on the + entire cluster. Especially for longer running jobs, speculative execution will create + duplicate map-tasks which will double-write your data to HBase; this is probably not what + you want. + See for more information.
-
- Speculative Execution - It is generally advisable to turn off speculative execution for - MapReduce jobs that use HBase as a source. This can either be done on a - per-Job basis through properties, on on the entire cluster. Especially - for longer running jobs, speculative execution will create duplicate - map-tasks which will double-write your data to HBase; this is probably - not what you want. - - See for more information. - -