diff --git a/src/docbkx/book.xml b/src/docbkx/book.xml index 8f1edeb532b..995e488c578 100644 --- a/src/docbkx/book.xml +++ b/src/docbkx/book.xml @@ -43,7 +43,6 @@ - Adding first cuts at Configuration, Getting Started, Data Model @@ -74,42 +73,40 @@ HBase and MapReduce - See HBase and MapReduce up in javadocs. Start there. Below are is some additional - help. + See HBase and MapReduce up in javadocs. + Start there. Below is some additional help.
The default HBase MapReduce Splitter - When an HBase table is used as a MapReduce source, - a map task will be created for each region in the table. + When TableInputFormat, + is used to source an HBase table in a MapReduce job, + its splitter will make a map task for each region of the table. Thus, if there are 100 regions in the table, there will be 100 map-tasks for the job - regardless of how many column families are selected in the Scan.
HBase Input MapReduce Example - To use HBase as a MapReduce source, the job would be configured via TableMapReduceUtil in the following manner... - - Job job = ...; - Scan scan = new Scan(); - scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs - scan.setCacheBlocks(false); - // set other scan attrs + To use HBase as a MapReduce source, + the job would be configured via TableMapReduceUtil in the following manner... + Job job = ...; +Scan scan = new Scan(); +scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs +scan.setCacheBlocks(false); +// Now set other scan attrs +... - TableMapReduceUtil.initTableMapperJob( - tableName, // input HBase table name - scan, // Scan instance to control CF and attribute selection - MyMapper.class, // mapper - Text.class, // reducer key - LongWritable.class, // reducer value - job // job instance - ); - +TableMapReduceUtil.initTableMapperJob( + tableName, // input HBase table name + scan, // Scan instance to control CF and attribute selection + MyMapper.class, // mapper + Text.class, // reducer key + LongWritable.class, // reducer value + job // job instance + ); ...and the mapper instance would extend TableMapper... - - public class MyMapper extends TableMapper<Text, LongWritable> { - public void map(ImmutableBytesWritable row, Result value, Context context) - throws InterruptedException, IOException { - - // process data for the row from the Result instance. - + public class MyMapper extends TableMapper<Text, LongWritable> { +public void map(ImmutableBytesWritable row, Result value, Context context) +throws InterruptedException, IOException { +// process data for the row from the Result instance.
@@ -118,21 +115,24 @@ MapReduce job, other HBase tables can be accessed as lookup tables, etc., in a MapReduce job via creating an HTable instance in the setup method of the Mapper. - - public class MyMapper extends TableMapper<Text, LongWritable> { - private HTable myOtherTable; + public class MyMapper extends TableMapper<Text, LongWritable> { + private HTable myOtherTable; - @Override - public void setup(Context context) { - myOtherTable = new HTable("myOtherTable"); - } - + @Override + public void setup(Context context) { + myOtherTable = new HTable("myOtherTable"); + }
HBase and Schema Design + A good general introduction on the strength and weaknesses modelling on + the various non-rdbms datastores is Ian Varleys' Master thesis, + No Relation: The Mixed Blessings of Non-Relational Databases. + Recommended. +
Schema Creation @@ -142,10 +142,6 @@ </para> </section> <section xml:id="number.of.cfs"> - <para>A good general introduction on the strength and weaknesses modelling on - the various non-rdbms datastores is Ian Varleys' Master thesis, - <link xlink:href="http://ianvarley.com/UT/MR/Varley_MastersReport_Full_2009-08-07.pdf">No Relation: The Mixed Blessings of Non-Relational Databases</link>. - </para> <title> On the number of column families @@ -177,7 +173,7 @@ If you do need to upload time series data into HBase, you should study OpenTSDB as a successful example. It has a page describing the schema it uses in - HBase. The key format in OpenTSDB is effectively [metric_type][event_timestamp], which would appear at first glance to contradict the previous advice about not using a timestamp as the key. However, the difference is that the timestamp is not in the lead position of the key, and the design assumption is that there are dozens or hundreds (or more) of different metric types. Thus, even with a continual stream of input data with a mix of metric types, the Puts are distributed across various points of regions in the table. + HBase. The key format in OpenTSDB is effectively [metric_type][event_timestamp], which would appear at first glance to contradict the previous advice about not using a timestamp as the key. However, the difference is that the timestamp is not in the lead position of the key, and the design assumption is that there are dozens or hundreds (or more) of different metric types. Thus, even with a continual stream of input data with a mix of metric types, the Puts are distributed across various points of regions in the table.
@@ -207,35 +203,32 @@ Tables in HBase are initially created with one region by default. For bulk imports, this means that all clients will write to the same region until it is large enough to split and become distributed across the cluster. A useful pattern to speed up the bulk import process is to pre-create empty regions. Be somewhat conservative in this, because too-many regions can actually degrade performance. An example of pre-creation using hex-keys is as follows (note: this example may need to be tweaked to the individual applications keys): - - public static boolean createTable(HBaseAdmin admin, HTableDescriptor table, byte[][] splits) - throws IOException { - try { - admin.createTable( table, splits ); - return true; - } catch (TableExistsException e) { - logger.info("table " + table.getNameAsString() + " already exists"); - // the table already exists... - return false; - } - } - public static byte[][] getHexSplits(String startKey, String endKey, int numRegions) { - byte[][] splits = new byte[numRegions-1][]; - BigInteger lowestKey = new BigInteger(startKey, 16); - BigInteger highestKey = new BigInteger(endKey, 16); - BigInteger range = highestKey.subtract(lowestKey); - - BigInteger regionIncrement = range.divide(BigInteger.valueOf(numRegions)); - lowestKey = lowestKey.add(regionIncrement); - for(int i=0; i < numRegions-1;i++) { - BigInteger key = lowestKey.add(regionIncrement.multiply(BigInteger.valueOf(i))); - byte[] b = String.format("%016x", key).getBytes(); - splits[i] = b; - } +public static boolean createTable(HBaseAdmin admin, HTableDescriptor table, byte[][] splits) +throws IOException { + try { + admin.createTable( table, splits ); + return true; + } catch (TableExistsException e) { + logger.info("table " + table.getNameAsString() + " already exists"); + // the table already exists... + return false; + } +} - return splits; - } - +public static byte[][] getHexSplits(String startKey, String endKey, int numRegions) { + byte[][] splits = new byte[numRegions-1][]; + BigInteger lowestKey = new BigInteger(startKey, 16); + BigInteger highestKey = new BigInteger(endKey, 16); + BigInteger range = highestKey.subtract(lowestKey); + BigInteger regionIncrement = range.divide(BigInteger.valueOf(numRegions)); + lowestKey = lowestKey.add(regionIncrement); + for(int i=0; i < numRegions-1;i++) { + BigInteger key = lowestKey.add(regionIncrement.multiply(BigInteger.valueOf(i))); + byte[] b = String.format("%016x", key).getBytes(); + splits[i] = b; + } + return splits; +}
@@ -422,7 +415,7 @@ Tables in HBase are initially created with one region by default. For bulk impo basically a synopsis of this article by Bruno Dumon. . -
+
Versions and HBase Operations In this section we look at the behavior of the version dimension @@ -674,14 +667,17 @@ Tables in HBase are initially created with one region by default. For bulk impo participate. The RegionServer splits a region, offlines the split region and then adds the daughter regions to META, opens daughters on the parent's hosting RegionServer and then reports the split to the - Master. + Master. See Managed Splitting for how to manually manage + splits (and for why you might do this)
Region Load Balancer - Periodically, and when there are not any regions in transition, a load balancer will run and move regions around to balance cluster load. + Periodically, and when there are not any regions in transition, + a load balancer will run and move regions around to balance cluster load. + The period at which it runs can be configured.
@@ -1071,7 +1067,7 @@ When I build, why do I always get Unable to find resource 'VM_global_libra - Runtime + Runtime Loading, why do I see pauses when loading HBase? @@ -1098,6 +1094,19 @@ When I build, why do I always get Unable to find resource 'VM_global_libra + How do I...? + + + Secondary Indexes in HBase? + + + + For a useful introduction to the issues involved maintaining a secondary Index in a store like HBase, + see the David Butler message in this thread, HBase, mail # user - Stargate+hbase + + + + diff --git a/src/docbkx/build.xml b/src/docbkx/build.xml index 46bafe96107..0bae0d073a2 100644 --- a/src/docbkx/build.xml +++ b/src/docbkx/build.xml @@ -9,8 +9,9 @@ xmlns:db="http://docbook.org/ns/docbook"> Building HBase
+ Adding an HBase release to Apache's Maven Repository Follow the instructions at Publishing Maven Artifacts. - The 'trick' to makiing it all work is answering the questions put to you by the mvn release plugin properly, + The 'trick' to making it all work is answering the questions put to you by the mvn release plugin properly, making sure it is using the actual branch, and finally, before doing the mvn release:perform, VERY IMPORTANT, hand edit the release.properties file that was put under HBASE_HOME by the previous step, release:perform. You need to edit it to make it point at right locations in SVN. diff --git a/src/docbkx/configuration.xml b/src/docbkx/configuration.xml index d28e8d30aa3..bdcc0bcf4aa 100644 --- a/src/docbkx/configuration.xml +++ b/src/docbkx/configuration.xml @@ -19,7 +19,7 @@ Be careful editing XML. Make sure you close all elements. -Run your file through xmmlint or similar +Run your file through xmllint or similar to ensure well-formedness of your document after an edit session. @@ -53,7 +53,7 @@ to ensure well-formedness of your document after an edit session. via a reading of the source code itself. - Changes here will require a cluster restart for HBase to notice the change. + Currently, changes here will require a cluster restart for HBase to notice the change.