Edit and formatting
git-svn-id: https://svn.apache.org/repos/asf/hbase/trunk@1085491 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
parent
20fda55327
commit
f55aa19018
|
@ -43,7 +43,6 @@
|
|||
<revhistory>
|
||||
<revision>
|
||||
<date />
|
||||
|
||||
<revdescription>Adding first cuts at Configuration, Getting Started, Data Model</revdescription>
|
||||
<revnumber>
|
||||
<?eval ${project.version}?>
|
||||
|
@ -74,42 +73,40 @@
|
|||
|
||||
<chapter xml:id="mapreduce">
|
||||
<title>HBase and MapReduce</title>
|
||||
<para>See <link xlink:href="http://hbase.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#package_description">HBase and MapReduce</link> up in javadocs. Start there. Below are is some additional
|
||||
help.</para>
|
||||
<para>See <link xlink:href="http://hbase.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#package_description">HBase and MapReduce</link> up in javadocs.
|
||||
Start there. Below is some additional help.</para>
|
||||
<section xml:id="splitter">
|
||||
<title>The default HBase MapReduce Splitter</title>
|
||||
<para>When an HBase table is used as a MapReduce source,
|
||||
a map task will be created for each region in the table.
|
||||
<para>When <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html">TableInputFormat</link>,
|
||||
is used to source an HBase table in a MapReduce job,
|
||||
its splitter will make a map task for each region of the table.
|
||||
Thus, if there are 100 regions in the table, there will be
|
||||
100 map-tasks for the job - regardless of how many column families are selected in the Scan.</para>
|
||||
</section>
|
||||
<section xml:id="mapreduce.example">
|
||||
<title>HBase Input MapReduce Example</title>
|
||||
<para>To use HBase as a MapReduce source, the job would be configured via <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableMapReduceUtil.html">TableMapReduceUtil</link> in the following manner...
|
||||
<programlisting>
|
||||
Job job = ...;
|
||||
Scan scan = new Scan();
|
||||
scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs
|
||||
scan.setCacheBlocks(false);
|
||||
// set other scan attrs
|
||||
<para>To use HBase as a MapReduce source,
|
||||
the job would be configured via <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableMapReduceUtil.html">TableMapReduceUtil</link> in the following manner...
|
||||
<programlisting>Job job = ...;
|
||||
Scan scan = new Scan();
|
||||
scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs
|
||||
scan.setCacheBlocks(false);
|
||||
// Now set other scan attrs
|
||||
...
|
||||
|
||||
TableMapReduceUtil.initTableMapperJob(
|
||||
TableMapReduceUtil.initTableMapperJob(
|
||||
tableName, // input HBase table name
|
||||
scan, // Scan instance to control CF and attribute selection
|
||||
MyMapper.class, // mapper
|
||||
Text.class, // reducer key
|
||||
LongWritable.class, // reducer value
|
||||
job // job instance
|
||||
);
|
||||
</programlisting>
|
||||
);</programlisting>
|
||||
...and the mapper instance would extend <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableMapper.html">TableMapper</link>...
|
||||
<programlisting>
|
||||
public class MyMapper extends TableMapper<Text, LongWritable> {
|
||||
public void map(ImmutableBytesWritable row, Result value, Context context)
|
||||
throws InterruptedException, IOException {
|
||||
|
||||
// process data for the row from the Result instance.
|
||||
</programlisting>
|
||||
<programlisting>public class MyMapper extends TableMapper<Text, LongWritable> {
|
||||
public void map(ImmutableBytesWritable row, Result value, Context context)
|
||||
throws InterruptedException, IOException {
|
||||
// process data for the row from the Result instance.</programlisting>
|
||||
</para>
|
||||
</section>
|
||||
<section xml:id="mapreduce.htable.access">
|
||||
|
@ -118,21 +115,24 @@
|
|||
MapReduce job, other HBase tables can
|
||||
be accessed as lookup tables, etc., in a
|
||||
MapReduce job via creating an HTable instance in the setup method of the Mapper.
|
||||
<programlisting>
|
||||
public class MyMapper extends TableMapper<Text, LongWritable> {
|
||||
<programlisting>public class MyMapper extends TableMapper<Text, LongWritable> {
|
||||
private HTable myOtherTable;
|
||||
|
||||
@Override
|
||||
public void setup(Context context) {
|
||||
myOtherTable = new HTable("myOtherTable");
|
||||
}
|
||||
</programlisting>
|
||||
}</programlisting>
|
||||
</para>
|
||||
</section>
|
||||
</chapter>
|
||||
|
||||
<chapter xml:id="schema">
|
||||
<title>HBase and Schema Design</title>
|
||||
<para>A good general introduction on the strength and weaknesses modelling on
|
||||
the various non-rdbms datastores is Ian Varleys' Master thesis,
|
||||
<link xlink:href="http://ianvarley.com/UT/MR/Varley_MastersReport_Full_2009-08-07.pdf">No Relation: The Mixed Blessings of Non-Relational Databases</link>.
|
||||
Recommended.
|
||||
</para>
|
||||
<section xml:id="schema.creation">
|
||||
<title>
|
||||
Schema Creation
|
||||
|
@ -142,10 +142,6 @@
|
|||
</para>
|
||||
</section>
|
||||
<section xml:id="number.of.cfs">
|
||||
<para>A good general introduction on the strength and weaknesses modelling on
|
||||
the various non-rdbms datastores is Ian Varleys' Master thesis,
|
||||
<link xlink:href="http://ianvarley.com/UT/MR/Varley_MastersReport_Full_2009-08-07.pdf">No Relation: The Mixed Blessings of Non-Relational Databases</link>.
|
||||
</para>
|
||||
<title>
|
||||
On the number of column families
|
||||
</title>
|
||||
|
@ -177,7 +173,7 @@
|
|||
<para>If you do need to upload time series data into HBase, you should
|
||||
study <link xlink:href="http://opentsdb.net/">OpenTSDB</link> as a
|
||||
successful example. It has a page describing the <link xlink:href=" http://opentsdb.net/schema.html">schema</link> it uses in
|
||||
HBase. The key format in OpenTSDB is effectively [metric_type][event_timestamp], which would appear at first glance to contradict the previous advice about not using a timestamp as the key. However, the difference is that the timestamp is not in the <b>lead</b> position of the key, and the design assumption is that there are dozens or hundreds (or more) of different metric types. Thus, even with a continual stream of input data with a mix of metric types, the Puts are distributed across various points of regions in the table.
|
||||
HBase. The key format in OpenTSDB is effectively [metric_type][event_timestamp], which would appear at first glance to contradict the previous advice about not using a timestamp as the key. However, the difference is that the timestamp is not in the <emphasis>lead</emphasis> position of the key, and the design assumption is that there are dozens or hundreds (or more) of different metric types. Thus, even with a continual stream of input data with a mix of metric types, the Puts are distributed across various points of regions in the table.
|
||||
</para>
|
||||
</section>
|
||||
<section xml:id="keysize">
|
||||
|
@ -207,9 +203,8 @@
|
|||
Tables in HBase are initially created with one region by default. For bulk imports, this means that all clients will write to the same region until it is large enough to split and become distributed across the cluster. A useful pattern to speed up the bulk import process is to pre-create empty regions. Be somewhat conservative in this, because too-many regions can actually degrade performance. An example of pre-creation using hex-keys is as follows (note: this example may need to be tweaked to the individual applications keys):
|
||||
</para>
|
||||
<para>
|
||||
<programlisting>
|
||||
public static boolean createTable(HBaseAdmin admin, HTableDescriptor table, byte[][] splits)
|
||||
throws IOException {
|
||||
<programlisting>public static boolean createTable(HBaseAdmin admin, HTableDescriptor table, byte[][] splits)
|
||||
throws IOException {
|
||||
try {
|
||||
admin.createTable( table, splits );
|
||||
return true;
|
||||
|
@ -218,13 +213,13 @@ Tables in HBase are initially created with one region by default. For bulk impo
|
|||
// the table already exists...
|
||||
return false;
|
||||
}
|
||||
}
|
||||
public static byte[][] getHexSplits(String startKey, String endKey, int numRegions) {
|
||||
}
|
||||
|
||||
public static byte[][] getHexSplits(String startKey, String endKey, int numRegions) {
|
||||
byte[][] splits = new byte[numRegions-1][];
|
||||
BigInteger lowestKey = new BigInteger(startKey, 16);
|
||||
BigInteger highestKey = new BigInteger(endKey, 16);
|
||||
BigInteger range = highestKey.subtract(lowestKey);
|
||||
|
||||
BigInteger regionIncrement = range.divide(BigInteger.valueOf(numRegions));
|
||||
lowestKey = lowestKey.add(regionIncrement);
|
||||
for(int i=0; i < numRegions-1;i++) {
|
||||
|
@ -232,10 +227,8 @@ Tables in HBase are initially created with one region by default. For bulk impo
|
|||
byte[] b = String.format("%016x", key).getBytes();
|
||||
splits[i] = b;
|
||||
}
|
||||
|
||||
return splits;
|
||||
}
|
||||
</programlisting>
|
||||
}</programlisting>
|
||||
</para>
|
||||
</section>
|
||||
|
||||
|
@ -422,7 +415,7 @@ Tables in HBase are initially created with one region by default. For bulk impo
|
|||
basically a synopsis of this article by Bruno Dumon.</para>
|
||||
</footnote>.</para>
|
||||
|
||||
<section xml:id="versions">
|
||||
<section xml:id="versions.ops">
|
||||
<title>Versions and HBase Operations</title>
|
||||
|
||||
<para>In this section we look at the behavior of the version dimension
|
||||
|
@ -674,14 +667,17 @@ Tables in HBase are initially created with one region by default. For bulk impo
|
|||
participate. The RegionServer splits a region, offlines the split
|
||||
region and then adds the daughter regions to META, opens daughters on
|
||||
the parent's hosting RegionServer and then reports the split to the
|
||||
Master.</para>
|
||||
Master. See <link linkend="disable.splitting">Managed Splitting</link> for how to manually manage
|
||||
splits (and for why you might do this)</para>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<title>Region Load Balancer</title>
|
||||
|
||||
<para>
|
||||
Periodically, and when there are not any regions in transition, a load balancer will run and move regions around to balance cluster load.
|
||||
Periodically, and when there are not any regions in transition,
|
||||
a load balancer will run and move regions around to balance cluster load.
|
||||
The period at which it runs can be configured.
|
||||
</para>
|
||||
</section>
|
||||
|
||||
|
@ -1098,6 +1094,19 @@ When I build, why do I always get <code>Unable to find resource 'VM_global_libra
|
|||
</answer>
|
||||
</qandaentry>
|
||||
</qandadiv>
|
||||
<qandadiv><title>How do I...?</title>
|
||||
<qandaentry xml:id="secondary.indices">
|
||||
<question><para>
|
||||
Secondary Indexes in HBase?
|
||||
</para></question>
|
||||
<answer>
|
||||
<para>
|
||||
For a useful introduction to the issues involved maintaining a secondary Index in a store like HBase,
|
||||
see the David Butler message in this thread, <link xlink:href="http://search-hadoop.com/m/nvbiBp2TDP/Stargate%252Bhbase&subj=Stargate+hbase">HBase, mail # user - Stargate+hbase</link>
|
||||
</para>
|
||||
</answer>
|
||||
</qandaentry>
|
||||
</qandadiv>
|
||||
</qandaset>
|
||||
</appendix>
|
||||
|
||||
|
|
|
@ -9,8 +9,9 @@
|
|||
xmlns:db="http://docbook.org/ns/docbook">
|
||||
<title>Building HBase</title>
|
||||
<section xml:id="mvn_repo">
|
||||
<title>Adding an HBase release to Apache's Maven Repository</title>
|
||||
<para>Follow the instructions at <link xlink:href="http://www.apache.org/dev/publishing-maven-artifacts.html">Publishing Maven Artifacts</link>.
|
||||
The 'trick' to makiing it all work is answering the questions put to you by the mvn release plugin properly,
|
||||
The 'trick' to making it all work is answering the questions put to you by the mvn release plugin properly,
|
||||
making sure it is using the actual branch,
|
||||
and finally, before doing the mvn release:perform,
|
||||
VERY IMPORTANT, hand edit the release.properties file that was put under HBASE_HOME by the previous step, release:perform. You need to edit it to make it point at right locations in SVN.
|
||||
|
|
|
@ -19,7 +19,7 @@
|
|||
<footnote>
|
||||
<para>
|
||||
Be careful editing XML. Make sure you close all elements.
|
||||
Run your file through <command>xmmlint</command> or similar
|
||||
Run your file through <command>xmllint</command> or similar
|
||||
to ensure well-formedness of your document after an edit session.
|
||||
</para>
|
||||
</footnote>
|
||||
|
@ -53,7 +53,7 @@ to ensure well-formedness of your document after an edit session.
|
|||
via a reading of the source code itself.
|
||||
</para>
|
||||
<para>
|
||||
Changes here will require a cluster restart for HBase to notice the change.
|
||||
Currently, changes here will require a cluster restart for HBase to notice the change.
|
||||
</para>
|
||||
<!--The file hbase-default.xml is generated as part of
|
||||
the build of the hbase site. See the hbase pom.xml.
|
||||
|
@ -68,12 +68,12 @@ to ensure well-formedness of your document after an edit session.
|
|||
<para>Set HBase environment variables in this file.
|
||||
Examples include options to pass the JVM on start of
|
||||
an HBase daemon such as heap size and garbarge collector configs.
|
||||
You also set configurations for HBase configuration, log directories,
|
||||
You can also set configurations for HBase configuration, log directories,
|
||||
niceness, ssh options, where to locate process pid files,
|
||||
etc., via settings in this file. Open the file at
|
||||
etc. Open the file at
|
||||
<filename>conf/hbase-env.sh</filename> and peruse its content.
|
||||
Each option is fairly well documented. Add your own environment
|
||||
variables here if you want them read by HBase daemon startup.</para>
|
||||
variables here if you want them read by HBase daemons on startup.</para>
|
||||
<para>
|
||||
Changes here will require a cluster restart for HBase to notice the change.
|
||||
</para>
|
||||
|
@ -92,7 +92,8 @@ to ensure well-formedness of your document after an edit session.
|
|||
|
||||
<section xml:id="important_configurations">
|
||||
<title>The Important Configurations</title>
|
||||
<para>Below we list the important Configurations. We've divided this section into
|
||||
<para>Below we list what the <emphasis>important</emphasis>
|
||||
Configurations. We've divided this section into
|
||||
required configuration and worth-a-look recommended configs.
|
||||
</para>
|
||||
|
||||
|
@ -292,26 +293,25 @@ of all regions.
|
|||
]]></programlisting>
|
||||
</para>
|
||||
<section>
|
||||
<title>Java client configuration connecting to an HBase cluster</title>
|
||||
<title>Java client configuration</title>
|
||||
<subtitle>How Java reads <filename>hbase-site.xml</filename> content</subtitle>
|
||||
<para>The configuration used by a client is kept
|
||||
<para>The configuration used by a java client is kept
|
||||
in an <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HBaseConfiguration">HBaseConfiguration</link> instance.
|
||||
Through the factory method on HBaseConfiguration...
|
||||
<programlisting>Configuration config = HBaseConfiguration.create();</programlisting>
|
||||
.. a client will pick up a configuration sourced from the first <filename>hbase-site.xml</filename> found on
|
||||
The factory method on HBaseConfiguration, <code>HBaseConfiguration.create();</code>,
|
||||
on invocation, will read in the content of the first <filename>hbase-site.xml</filename> found on
|
||||
the client's <varname>CLASSPATH</varname>, if one is present
|
||||
(We'll also factor in any <filename>hbase-default.xml</filename> found; an hbase-default.xml ships inside
|
||||
the <filename>hbase.X.X.X.jar</filename>).
|
||||
(Invocation will also factor in any <filename>hbase-default.xml</filename> found;
|
||||
an hbase-default.xml ships inside the <filename>hbase.X.X.X.jar</filename>).
|
||||
It is also possible to specify configuration directly without having to read from a
|
||||
<filename>hbase-site.xml</filename>. For examplt to set the
|
||||
<link linkend="zookeeper">zookeeper</link> ensemble for the cluster programmatically do as follows...
|
||||
<filename>hbase-site.xml</filename>. For example, to set the
|
||||
<link linkend="zookeeper">zookeeper</link> ensemble for the cluster programmatically do as follows:
|
||||
<programlisting>Configuration config = HBaseConfiguration.create();
|
||||
config.set("hbase.zookeeper.quorum", "localhost"); // we are running zookeeper locally
|
||||
</programlisting>
|
||||
config.set("hbase.zookeeper.quorum", "localhost"); // Here we are running zookeeper locally</programlisting>
|
||||
If multiple <link linkend="zookeeper">zookeeper</link> instances make up your zookeeper ensemble,
|
||||
they may be specified in a comma-list (just as in the <filename>hbase-site.xml</filename> file).
|
||||
This populated Configuration instance can then be passed to an
|
||||
<link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html">HTable</link> instance via the overloaded constructor.
|
||||
they may be specified in a comma-separated list (just as in the <filename>hbase-site.xml</filename> file).
|
||||
This populated <classname>Configuration</classname> instance can then be passed to an
|
||||
<link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html">HTable</link>,
|
||||
and so on.
|
||||
</para>
|
||||
</section>
|
||||
</section>
|
||||
|
|
|
@ -104,17 +104,15 @@
|
|||
<title>AutoFlush</title>
|
||||
|
||||
<para>When performing a lot of Puts, make sure that setAutoFlush is set
|
||||
to false on <link
|
||||
to false on your <link
|
||||
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html">HTable</link>
|
||||
instance. Otherwise, the Puts will be sent one at a time to the
|
||||
regionserver. Puts added via... <programlisting>
|
||||
htable.add(Put);
|
||||
</programlisting> ... and ... <programlisting>
|
||||
htable.add( <List> Put);
|
||||
</programlisting> ... wind up in the same write buffer. If autoFlush=false,
|
||||
regionserver. Puts added via <code> htable.add(Put)</code> and <code> htable.add( <List> Put)</code>
|
||||
wind up in the same write buffer. If <code>autoFlush = false</code>,
|
||||
these messages are not sent until the write-buffer is filled. To
|
||||
explicitly flush the messages, call .flushCommits(). Calling .close() on
|
||||
the htable instance will invoke flushCommits().</para>
|
||||
explicitly flush the messages, call <methodname>flushCommits</methodname>.
|
||||
Calling <methodname>close</methodname> on the <classname>HTable</classname>
|
||||
instance will invoke <methodname>flushCommits</methodname>.</para>
|
||||
</section>
|
||||
|
||||
<section xml:id="perf.hbase.client.caching">
|
||||
|
@ -123,7 +121,7 @@ htable.add( <List> Put);
|
|||
<para>If HBase is used as an input source for a MapReduce job, for
|
||||
example, make sure that the input <link
|
||||
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html">Scan</link>
|
||||
instance to the MapReduce job has setCaching set to something greater
|
||||
instance to the MapReduce job has <methodname>setCaching</methodname> set to something greater
|
||||
than the default (which is 1). Using the default value means that the
|
||||
map-task will make call back to the region-server for every record
|
||||
processed. Setting this value to 500, for example, will transfer 500
|
||||
|
@ -150,8 +148,7 @@ try {
|
|||
} finally {
|
||||
rs.close(); // always close the ResultScanner!
|
||||
}
|
||||
htable.close();
|
||||
</programlisting></para>
|
||||
htable.close();</programlisting></para>
|
||||
</section>
|
||||
|
||||
<section xml:id="perf.hbase.client.blockcache">
|
||||
|
@ -160,8 +157,8 @@ htable.close();
|
|||
<para><link
|
||||
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html">Scan</link>
|
||||
instances can be set to use the block cache in the region server via the
|
||||
setCacheBlocks method. For input Scans to MapReduce jobs, this should be
|
||||
false. For frequently access rows, it is advisable to use the block
|
||||
<methodname>setCacheBlocks</methodname> method. For input Scans to MapReduce jobs, this should be
|
||||
<varname>false</varname>. For frequently accessed rows, it is advisable to use the block
|
||||
cache.</para>
|
||||
</section>
|
||||
</section>
|
||||
|
|
|
@ -11,7 +11,7 @@
|
|||
|
||||
<para>
|
||||
The HBase Shell is <link xlink:href="http://jruby.org">(J)Ruby</link>'s
|
||||
IRB with some HBase particular verbs added. Anything you can do in
|
||||
IRB with some HBase particular commands added. Anything you can do in
|
||||
IRB, you should be able to do in the HBase Shell.</para>
|
||||
<para>To run the HBase shell,
|
||||
do as follows:
|
||||
|
|
Loading…
Reference in New Issue