Edit and formatting

git-svn-id: https://svn.apache.org/repos/asf/hbase/trunk@1085491 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
Michael Stack 2011-03-25 18:00:28 +00:00
parent 20fda55327
commit f55aa19018
5 changed files with 115 additions and 108 deletions

View File

@ -43,7 +43,6 @@
<revhistory>
<revision>
<date />
<revdescription>Adding first cuts at Configuration, Getting Started, Data Model</revdescription>
<revnumber>
<?eval ${project.version}?>
@ -74,42 +73,40 @@
<chapter xml:id="mapreduce">
<title>HBase and MapReduce</title>
<para>See <link xlink:href="http://hbase.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#package_description">HBase and MapReduce</link> up in javadocs. Start there. Below are is some additional
help.</para>
<para>See <link xlink:href="http://hbase.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#package_description">HBase and MapReduce</link> up in javadocs.
Start there. Below is some additional help.</para>
<section xml:id="splitter">
<title>The default HBase MapReduce Splitter</title>
<para>When an HBase table is used as a MapReduce source,
a map task will be created for each region in the table.
<para>When <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html">TableInputFormat</link>,
is used to source an HBase table in a MapReduce job,
its splitter will make a map task for each region of the table.
Thus, if there are 100 regions in the table, there will be
100 map-tasks for the job - regardless of how many column families are selected in the Scan.</para>
</section>
<section xml:id="mapreduce.example">
<title>HBase Input MapReduce Example</title>
<para>To use HBase as a MapReduce source, the job would be configured via <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableMapReduceUtil.html">TableMapReduceUtil</link> in the following manner...
<programlisting>
Job job = ...;
Scan scan = new Scan();
scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false);
// set other scan attrs
<para>To use HBase as a MapReduce source,
the job would be configured via <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableMapReduceUtil.html">TableMapReduceUtil</link> in the following manner...
<programlisting>Job job = ...;
Scan scan = new Scan();
scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false);
// Now set other scan attrs
...
TableMapReduceUtil.initTableMapperJob(
tableName, // input HBase table name
scan, // Scan instance to control CF and attribute selection
MyMapper.class, // mapper
Text.class, // reducer key
LongWritable.class, // reducer value
job // job instance
);
</programlisting>
TableMapReduceUtil.initTableMapperJob(
tableName, // input HBase table name
scan, // Scan instance to control CF and attribute selection
MyMapper.class, // mapper
Text.class, // reducer key
LongWritable.class, // reducer value
job // job instance
);</programlisting>
...and the mapper instance would extend <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableMapper.html">TableMapper</link>...
<programlisting>
public class MyMapper extends TableMapper&lt;Text, LongWritable&gt; {
public void map(ImmutableBytesWritable row, Result value, Context context)
throws InterruptedException, IOException {
// process data for the row from the Result instance.
</programlisting>
<programlisting>public class MyMapper extends TableMapper&lt;Text, LongWritable&gt; {
public void map(ImmutableBytesWritable row, Result value, Context context)
throws InterruptedException, IOException {
// process data for the row from the Result instance.</programlisting>
</para>
</section>
<section xml:id="mapreduce.htable.access">
@ -118,21 +115,24 @@
MapReduce job, other HBase tables can
be accessed as lookup tables, etc., in a
MapReduce job via creating an HTable instance in the setup method of the Mapper.
<programlisting>
public class MyMapper extends TableMapper&lt;Text, LongWritable&gt; {
private HTable myOtherTable;
<programlisting>public class MyMapper extends TableMapper&lt;Text, LongWritable&gt; {
private HTable myOtherTable;
@Override
public void setup(Context context) {
myOtherTable = new HTable("myOtherTable");
}
</programlisting>
@Override
public void setup(Context context) {
myOtherTable = new HTable("myOtherTable");
}</programlisting>
</para>
</section>
</chapter>
<chapter xml:id="schema">
<title>HBase and Schema Design</title>
<para>A good general introduction on the strength and weaknesses modelling on
the various non-rdbms datastores is Ian Varleys' Master thesis,
<link xlink:href="http://ianvarley.com/UT/MR/Varley_MastersReport_Full_2009-08-07.pdf">No Relation: The Mixed Blessings of Non-Relational Databases</link>.
Recommended.
</para>
<section xml:id="schema.creation">
<title>
Schema Creation
@ -142,10 +142,6 @@
</para>
</section>
<section xml:id="number.of.cfs">
<para>A good general introduction on the strength and weaknesses modelling on
the various non-rdbms datastores is Ian Varleys' Master thesis,
<link xlink:href="http://ianvarley.com/UT/MR/Varley_MastersReport_Full_2009-08-07.pdf">No Relation: The Mixed Blessings of Non-Relational Databases</link>.
</para>
<title>
On the number of column families
</title>
@ -177,7 +173,7 @@
<para>If you do need to upload time series data into HBase, you should
study <link xlink:href="http://opentsdb.net/">OpenTSDB</link> as a
successful example. It has a page describing the <link xlink:href=" http://opentsdb.net/schema.html">schema</link> it uses in
HBase. The key format in OpenTSDB is effectively [metric_type][event_timestamp], which would appear at first glance to contradict the previous advice about not using a timestamp as the key. However, the difference is that the timestamp is not in the <b>lead</b> position of the key, and the design assumption is that there are dozens or hundreds (or more) of different metric types. Thus, even with a continual stream of input data with a mix of metric types, the Puts are distributed across various points of regions in the table.
HBase. The key format in OpenTSDB is effectively [metric_type][event_timestamp], which would appear at first glance to contradict the previous advice about not using a timestamp as the key. However, the difference is that the timestamp is not in the <emphasis>lead</emphasis> position of the key, and the design assumption is that there are dozens or hundreds (or more) of different metric types. Thus, even with a continual stream of input data with a mix of metric types, the Puts are distributed across various points of regions in the table.
</para>
</section>
<section xml:id="keysize">
@ -207,35 +203,32 @@
Tables in HBase are initially created with one region by default. For bulk imports, this means that all clients will write to the same region until it is large enough to split and become distributed across the cluster. A useful pattern to speed up the bulk import process is to pre-create empty regions. Be somewhat conservative in this, because too-many regions can actually degrade performance. An example of pre-creation using hex-keys is as follows (note: this example may need to be tweaked to the individual applications keys):
</para>
<para>
<programlisting>
public static boolean createTable(HBaseAdmin admin, HTableDescriptor table, byte[][] splits)
throws IOException {
try {
admin.createTable( table, splits );
return true;
} catch (TableExistsException e) {
logger.info("table " + table.getNameAsString() + " already exists");
// the table already exists...
return false;
}
}
public static byte[][] getHexSplits(String startKey, String endKey, int numRegions) {
byte[][] splits = new byte[numRegions-1][];
BigInteger lowestKey = new BigInteger(startKey, 16);
BigInteger highestKey = new BigInteger(endKey, 16);
BigInteger range = highestKey.subtract(lowestKey);
BigInteger regionIncrement = range.divide(BigInteger.valueOf(numRegions));
lowestKey = lowestKey.add(regionIncrement);
for(int i=0; i &lt; numRegions-1;i++) {
BigInteger key = lowestKey.add(regionIncrement.multiply(BigInteger.valueOf(i)));
byte[] b = String.format("%016x", key).getBytes();
splits[i] = b;
}
<programlisting>public static boolean createTable(HBaseAdmin admin, HTableDescriptor table, byte[][] splits)
throws IOException {
try {
admin.createTable( table, splits );
return true;
} catch (TableExistsException e) {
logger.info("table " + table.getNameAsString() + " already exists");
// the table already exists...
return false;
}
}
return splits;
}
</programlisting>
public static byte[][] getHexSplits(String startKey, String endKey, int numRegions) {
byte[][] splits = new byte[numRegions-1][];
BigInteger lowestKey = new BigInteger(startKey, 16);
BigInteger highestKey = new BigInteger(endKey, 16);
BigInteger range = highestKey.subtract(lowestKey);
BigInteger regionIncrement = range.divide(BigInteger.valueOf(numRegions));
lowestKey = lowestKey.add(regionIncrement);
for(int i=0; i &lt; numRegions-1;i++) {
BigInteger key = lowestKey.add(regionIncrement.multiply(BigInteger.valueOf(i)));
byte[] b = String.format("%016x", key).getBytes();
splits[i] = b;
}
return splits;
}</programlisting>
</para>
</section>
@ -422,7 +415,7 @@ Tables in HBase are initially created with one region by default. For bulk impo
basically a synopsis of this article by Bruno Dumon.</para>
</footnote>.</para>
<section xml:id="versions">
<section xml:id="versions.ops">
<title>Versions and HBase Operations</title>
<para>In this section we look at the behavior of the version dimension
@ -674,14 +667,17 @@ Tables in HBase are initially created with one region by default. For bulk impo
participate. The RegionServer splits a region, offlines the split
region and then adds the daughter regions to META, opens daughters on
the parent's hosting RegionServer and then reports the split to the
Master.</para>
Master. See <link linkend="disable.splitting">Managed Splitting</link> for how to manually manage
splits (and for why you might do this)</para>
</section>
<section>
<title>Region Load Balancer</title>
<para>
Periodically, and when there are not any regions in transition, a load balancer will run and move regions around to balance cluster load.
Periodically, and when there are not any regions in transition,
a load balancer will run and move regions around to balance cluster load.
The period at which it runs can be configured.
</para>
</section>
@ -1071,7 +1067,7 @@ When I build, why do I always get <code>Unable to find resource 'VM_global_libra
</answer>
</qandaentry>
</qandadiv>
<qandadiv><title>Runtime</title>
<qandadiv><title>Runtime</title>
<qandaentry>
<question><para>
Loading, why do I see pauses when loading HBase?
@ -1098,6 +1094,19 @@ When I build, why do I always get <code>Unable to find resource 'VM_global_libra
</answer>
</qandaentry>
</qandadiv>
<qandadiv><title>How do I...?</title>
<qandaentry xml:id="secondary.indices">
<question><para>
Secondary Indexes in HBase?
</para></question>
<answer>
<para>
For a useful introduction to the issues involved maintaining a secondary Index in a store like HBase,
see the David Butler message in this thread, <link xlink:href="http://search-hadoop.com/m/nvbiBp2TDP/Stargate%252Bhbase&amp;subj=Stargate+hbase">HBase, mail # user - Stargate+hbase</link>
</para>
</answer>
</qandaentry>
</qandadiv>
</qandaset>
</appendix>

View File

@ -9,8 +9,9 @@
xmlns:db="http://docbook.org/ns/docbook">
<title>Building HBase</title>
<section xml:id="mvn_repo">
<title>Adding an HBase release to Apache's Maven Repository</title>
<para>Follow the instructions at <link xlink:href="http://www.apache.org/dev/publishing-maven-artifacts.html">Publishing Maven Artifacts</link>.
The 'trick' to makiing it all work is answering the questions put to you by the mvn release plugin properly,
The 'trick' to making it all work is answering the questions put to you by the mvn release plugin properly,
making sure it is using the actual branch,
and finally, before doing the mvn release:perform,
VERY IMPORTANT, hand edit the release.properties file that was put under HBASE_HOME by the previous step, release:perform. You need to edit it to make it point at right locations in SVN.

View File

@ -19,7 +19,7 @@
<footnote>
<para>
Be careful editing XML. Make sure you close all elements.
Run your file through <command>xmmlint</command> or similar
Run your file through <command>xmllint</command> or similar
to ensure well-formedness of your document after an edit session.
</para>
</footnote>
@ -53,7 +53,7 @@ to ensure well-formedness of your document after an edit session.
via a reading of the source code itself.
</para>
<para>
Changes here will require a cluster restart for HBase to notice the change.
Currently, changes here will require a cluster restart for HBase to notice the change.
</para>
<!--The file hbase-default.xml is generated as part of
the build of the hbase site. See the hbase pom.xml.
@ -68,12 +68,12 @@ to ensure well-formedness of your document after an edit session.
<para>Set HBase environment variables in this file.
Examples include options to pass the JVM on start of
an HBase daemon such as heap size and garbarge collector configs.
You also set configurations for HBase configuration, log directories,
You can also set configurations for HBase configuration, log directories,
niceness, ssh options, where to locate process pid files,
etc., via settings in this file. Open the file at
etc. Open the file at
<filename>conf/hbase-env.sh</filename> and peruse its content.
Each option is fairly well documented. Add your own environment
variables here if you want them read by HBase daemon startup.</para>
variables here if you want them read by HBase daemons on startup.</para>
<para>
Changes here will require a cluster restart for HBase to notice the change.
</para>
@ -92,7 +92,8 @@ to ensure well-formedness of your document after an edit session.
<section xml:id="important_configurations">
<title>The Important Configurations</title>
<para>Below we list the important Configurations. We've divided this section into
<para>Below we list what the <emphasis>important</emphasis>
Configurations. We've divided this section into
required configuration and worth-a-look recommended configs.
</para>
@ -292,26 +293,25 @@ of all regions.
]]></programlisting>
</para>
<section>
<title>Java client configuration connecting to an HBase cluster</title>
<title>Java client configuration</title>
<subtitle>How Java reads <filename>hbase-site.xml</filename> content</subtitle>
<para>The configuration used by a client is kept
<para>The configuration used by a java client is kept
in an <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HBaseConfiguration">HBaseConfiguration</link> instance.
Through the factory method on HBaseConfiguration...
<programlisting>Configuration config = HBaseConfiguration.create();</programlisting>
.. a client will pick up a configuration sourced from the first <filename>hbase-site.xml</filename> found on
The factory method on HBaseConfiguration, <code>HBaseConfiguration.create();</code>,
on invocation, will read in the content of the first <filename>hbase-site.xml</filename> found on
the client's <varname>CLASSPATH</varname>, if one is present
(We'll also factor in any <filename>hbase-default.xml</filename> found; an hbase-default.xml ships inside
the <filename>hbase.X.X.X.jar</filename>).
(Invocation will also factor in any <filename>hbase-default.xml</filename> found;
an hbase-default.xml ships inside the <filename>hbase.X.X.X.jar</filename>).
It is also possible to specify configuration directly without having to read from a
<filename>hbase-site.xml</filename>. For examplt to set the
<link linkend="zookeeper">zookeeper</link> ensemble for the cluster programmatically do as follows...
<filename>hbase-site.xml</filename>. For example, to set the
<link linkend="zookeeper">zookeeper</link> ensemble for the cluster programmatically do as follows:
<programlisting>Configuration config = HBaseConfiguration.create();
config.set("hbase.zookeeper.quorum", "localhost"); // we are running zookeeper locally
</programlisting>
config.set("hbase.zookeeper.quorum", "localhost"); // Here we are running zookeeper locally</programlisting>
If multiple <link linkend="zookeeper">zookeeper</link> instances make up your zookeeper ensemble,
they may be specified in a comma-list (just as in the <filename>hbase-site.xml</filename> file).
This populated Configuration instance can then be passed to an
<link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html">HTable</link> instance via the overloaded constructor.
they may be specified in a comma-separated list (just as in the <filename>hbase-site.xml</filename> file).
This populated <classname>Configuration</classname> instance can then be passed to an
<link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html">HTable</link>,
and so on.
</para>
</section>
</section>

View File

@ -104,17 +104,15 @@
<title>AutoFlush</title>
<para>When performing a lot of Puts, make sure that setAutoFlush is set
to false on <link
to false on your <link
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html">HTable</link>
instance. Otherwise, the Puts will be sent one at a time to the
regionserver. Puts added via... <programlisting>
htable.add(Put);
</programlisting> ... and ... <programlisting>
htable.add( &lt;List&gt; Put);
</programlisting> ... wind up in the same write buffer. If autoFlush=false,
regionserver. Puts added via <code> htable.add(Put)</code> and <code> htable.add( &lt;List&gt; Put)</code>
wind up in the same write buffer. If <code>autoFlush = false</code>,
these messages are not sent until the write-buffer is filled. To
explicitly flush the messages, call .flushCommits(). Calling .close() on
the htable instance will invoke flushCommits().</para>
explicitly flush the messages, call <methodname>flushCommits</methodname>.
Calling <methodname>close</methodname> on the <classname>HTable</classname>
instance will invoke <methodname>flushCommits</methodname>.</para>
</section>
<section xml:id="perf.hbase.client.caching">
@ -123,7 +121,7 @@ htable.add( &lt;List&gt; Put);
<para>If HBase is used as an input source for a MapReduce job, for
example, make sure that the input <link
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html">Scan</link>
instance to the MapReduce job has setCaching set to something greater
instance to the MapReduce job has <methodname>setCaching</methodname> set to something greater
than the default (which is 1). Using the default value means that the
map-task will make call back to the region-server for every record
processed. Setting this value to 500, for example, will transfer 500
@ -150,8 +148,7 @@ try {
} finally {
rs.close(); // always close the ResultScanner!
}
htable.close();
</programlisting></para>
htable.close();</programlisting></para>
</section>
<section xml:id="perf.hbase.client.blockcache">
@ -160,8 +157,8 @@ htable.close();
<para><link
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html">Scan</link>
instances can be set to use the block cache in the region server via the
setCacheBlocks method. For input Scans to MapReduce jobs, this should be
false. For frequently access rows, it is advisable to use the block
<methodname>setCacheBlocks</methodname> method. For input Scans to MapReduce jobs, this should be
<varname>false</varname>. For frequently accessed rows, it is advisable to use the block
cache.</para>
</section>
</section>

View File

@ -11,7 +11,7 @@
<para>
The HBase Shell is <link xlink:href="http://jruby.org">(J)Ruby</link>'s
IRB with some HBase particular verbs added. Anything you can do in
IRB with some HBase particular commands added. Anything you can do in
IRB, you should be able to do in the HBase Shell.</para>
<para>To run the HBase shell,
do as follows: