More edits: Moved ZK to its own chapter, put the bloom filter stuff together in one place, made the distributed setup more focused

git-svn-id: https://svn.apache.org/repos/asf/hbase/trunk@1389153 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
Michael Stack 2012-09-23 22:01:16 +00:00
parent 7d709c965a
commit 623a9be04d
5 changed files with 704 additions and 673 deletions

View File

@ -2319,65 +2319,6 @@ myHtd.setValue(HTableDescriptor.SPLIT_POLICY, MyCustomSplitPolicy.class.getName(
</section> <!-- store -->
<section xml:id="blooms">
<title>Bloom Filters</title>
<para><link xlink:href="http://en.wikipedia.org/wiki/Bloom_filter">Bloom filters</link> were developed over in <link
xlink:href="https://issues.apache.org/jira/browse/HBASE-1200">HBase-1200
Add bloomfilters</link>.<footnote>
<para>For description of the development process -- why static blooms
rather than dynamic -- and for an overview of the unique properties
that pertain to blooms in HBase, as well as possible future
directions, see the <emphasis>Development Process</emphasis> section
of the document <link
xlink:href="https://issues.apache.org/jira/secure/attachment/12444007/Bloom_Filters_in_HBase.pdf">BloomFilters
in HBase</link> attached to <link
xlink:href="https://issues.apache.org/jira/browse/HBASE-1200">HBase-1200</link>.</para>
</footnote><footnote>
<para>The bloom filters described here are actually version two of
blooms in HBase. In versions up to 0.19.x, HBase had a dynamic bloom
option based on work done by the <link
xlink:href="http://www.one-lab.org">European Commission One-Lab
Project 034819</link>. The core of the HBase bloom work was later
pulled up into Hadoop to implement org.apache.hadoop.io.BloomMapFile.
Version 1 of HBase blooms never worked that well. Version 2 is a
rewrite from scratch though again it starts with the one-lab
work.</para>
</footnote></para>
<para>See also <xref linkend="schema.bloom" /> and <xref linkend="config.bloom" />.
</para>
<section xml:id="bloom_footprint">
<title>Bloom StoreFile footprint</title>
<para>Bloom filters add an entry to the <classname>StoreFile</classname>
general <classname>FileInfo</classname> data structure and then two
extra entries to the <classname>StoreFile</classname> metadata
section.</para>
<section>
<title>BloomFilter in the <classname>StoreFile</classname>
<classname>FileInfo</classname> data structure</title>
<para><classname>FileInfo</classname> has a
<varname>BLOOM_FILTER_TYPE</varname> entry which is set to
<varname>NONE</varname>, <varname>ROW</varname> or
<varname>ROWCOL.</varname></para>
</section>
<section>
<title>BloomFilter entries in <classname>StoreFile</classname>
metadata</title>
<para><varname>BLOOM_FILTER_META</varname> holds Bloom Size, Hash
Function used, etc. Its small in size and is cached on
<classname>StoreFile.Reader</classname> load</para>
<para><varname>BLOOM_FILTER_DATA</varname> is the actual bloomfilter
data. Obtained on-demand. Stored in the LRU cache, if it is enabled
(Its enabled by default).</para>
</section>
</section>
</section> <!-- bloom -->
</section> <!-- regions -->
<section xml:id="arch.bulk.load"><title>Bulk Loading</title>
@ -2519,6 +2460,7 @@ myHtd.setValue(HTableDescriptor.SPLIT_POLICY, MyCustomSplitPolicy.class.getName(
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="case_studies.xml" />
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="ops_mgt.xml" />
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="developer.xml" />
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="zookeeper.xml" />
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="community.xml" />
<appendix xml:id="faq">

View File

@ -27,8 +27,10 @@
*/
-->
<title>Configuration</title>
<para>This chapter is the Not-So-Quick start guide to HBase configuration.</para>
<para>Please read this chapter carefully and ensure that all requirements have
<para>This chapter is the Not-So-Quick start guide to HBase configuration. It goes
over system requirements, Hadoop setup, the different HBase run modes, and the
various configurations in HBase. Please read this chapter carefully and ensure
that all <xref linkend="basic.requirements" /> have
been satisfied. Failure to do so will cause you (and us) grief debugging strange errors
and/or data loss.</para>
@ -56,6 +58,10 @@ to ensure well-formedness of your document after an edit session.
all nodes of the cluster. HBase will not do this for you.
Use <command>rsync</command>.</para>
<section xml:id="basic.requirements">
<title>Basic Requirements</title>
<para>This section lists required services and some required system configuration.
</para>
<section xml:id="java">
<title>Java</title>
@ -237,7 +243,6 @@ to ensure well-formedness of your document after an edit session.
Currently only Hadoop versions 0.20.205.x or any release in excess of this
version -- this includes hadoop 1.0.0 -- have a working, durable sync
<footnote>
<title>On Hadoop Versions</title>
<para>The Cloudera blog post <link xlink:href="http://www.cloudera.com/blog/2012/01/an-update-on-apache-hadoop-1-0/">An update on Apache Hadoop 1.0</link>
by Charles Zedlweski has a nice exposition on how all the Hadoop versions relate.
Its worth checking out if you are having trouble making sense of the
@ -352,6 +357,7 @@ to ensure well-formedness of your document after an edit session.
</section>
</section> <!-- hadoop -->
</section>
<section xml:id="standalone_dist">
<title>HBase run modes: Standalone and Distributed</title>
@ -686,565 +692,6 @@ stopping hbase...............</programlisting> Shutdown can take a moment to
</section>
</section> <!-- run modes -->
<section xml:id="zookeeper">
<title>ZooKeeper<indexterm>
<primary>ZooKeeper</primary>
</indexterm></title>
<para>A distributed HBase depends on a running ZooKeeper cluster.
All participating nodes and clients need to be able to access the
running ZooKeeper ensemble. HBase by default manages a ZooKeeper
"cluster" for you. It will start and stop the ZooKeeper ensemble
as part of the HBase start/stop process. You can also manage the
ZooKeeper ensemble independent of HBase and just point HBase at
the cluster it should use. To toggle HBase management of
ZooKeeper, use the <varname>HBASE_MANAGES_ZK</varname> variable in
<filename>conf/hbase-env.sh</filename>. This variable, which
defaults to <varname>true</varname>, tells HBase whether to
start/stop the ZooKeeper ensemble servers as part of HBase
start/stop.</para>
<para>When HBase manages the ZooKeeper ensemble, you can specify
ZooKeeper configuration using its native
<filename>zoo.cfg</filename> file, or, the easier option is to
just specify ZooKeeper options directly in
<filename>conf/hbase-site.xml</filename>. A ZooKeeper
configuration option can be set as a property in the HBase
<filename>hbase-site.xml</filename> XML configuration file by
prefacing the ZooKeeper option name with
<varname>hbase.zookeeper.property</varname>. For example, the
<varname>clientPort</varname> setting in ZooKeeper can be changed
by setting the
<varname>hbase.zookeeper.property.clientPort</varname> property.
For all default values used by HBase, including ZooKeeper
configuration, see <xref linkend="hbase_default_configurations" />. Look for the
<varname>hbase.zookeeper.property</varname> prefix <footnote>
<para>For the full list of ZooKeeper configurations, see
ZooKeeper's <filename>zoo.cfg</filename>. HBase does not ship
with a <filename>zoo.cfg</filename> so you will need to browse
the <filename>conf</filename> directory in an appropriate
ZooKeeper download.</para>
</footnote></para>
<para>You must at least list the ensemble servers in
<filename>hbase-site.xml</filename> using the
<varname>hbase.zookeeper.quorum</varname> property. This property
defaults to a single ensemble member at
<varname>localhost</varname> which is not suitable for a fully
distributed HBase. (It binds to the local machine only and remote
clients will not be able to connect). <note xml:id="how_many_zks">
<title>How many ZooKeepers should I run?</title>
<para>You can run a ZooKeeper ensemble that comprises 1 node
only but in production it is recommended that you run a
ZooKeeper ensemble of 3, 5 or 7 machines; the more members an
ensemble has, the more tolerant the ensemble is of host
failures. Also, run an odd number of machines. In ZooKeeper,
an even number of peers is supported, but it is normally not used
because an even sized ensemble requires, proportionally, more peers
to form a quorum than an odd sized ensemble requires. For example, an
ensemble with 4 peers requires 3 to form a quorum, while an ensemble with
5 also requires 3 to form a quorum. Thus, an ensemble of 5 allows 2 peers to
fail, and thus is more fault tolerant than the ensemble of 4, which allows
only 1 down peer.
</para>
<para>Give each ZooKeeper server around 1GB of RAM, and if possible, its own
dedicated disk (A dedicated disk is the best thing you can do
to ensure a performant ZooKeeper ensemble). For very heavily
loaded clusters, run ZooKeeper servers on separate machines
from RegionServers (DataNodes and TaskTrackers).</para>
</note></para>
<para>For example, to have HBase manage a ZooKeeper quorum on
nodes <emphasis>rs{1,2,3,4,5}.example.com</emphasis>, bound to
port 2222 (the default is 2181) ensure
<varname>HBASE_MANAGE_ZK</varname> is commented out or set to
<varname>true</varname> in <filename>conf/hbase-env.sh</filename>
and then edit <filename>conf/hbase-site.xml</filename> and set
<varname>hbase.zookeeper.property.clientPort</varname> and
<varname>hbase.zookeeper.quorum</varname>. You should also set
<varname>hbase.zookeeper.property.dataDir</varname> to other than
the default as the default has ZooKeeper persist data under
<filename>/tmp</filename> which is often cleared on system
restart. In the example below we have ZooKeeper persist to
<filename>/user/local/zookeeper</filename>. <programlisting>
&lt;configuration&gt;
...
&lt;property&gt;
&lt;name&gt;hbase.zookeeper.property.clientPort&lt;/name&gt;
&lt;value&gt;2222&lt;/value&gt;
&lt;description&gt;Property from ZooKeeper's config zoo.cfg.
The port at which the clients will connect.
&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;hbase.zookeeper.quorum&lt;/name&gt;
&lt;value&gt;rs1.example.com,rs2.example.com,rs3.example.com,rs4.example.com,rs5.example.com&lt;/value&gt;
&lt;description&gt;Comma separated list of servers in the ZooKeeper Quorum.
For example, "host1.mydomain.com,host2.mydomain.com,host3.mydomain.com".
By default this is set to localhost for local and pseudo-distributed modes
of operation. For a fully-distributed setup, this should be set to a full
list of ZooKeeper quorum servers. If HBASE_MANAGES_ZK is set in hbase-env.sh
this is the list of servers which we will start/stop ZooKeeper on.
&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;hbase.zookeeper.property.dataDir&lt;/name&gt;
&lt;value&gt;/usr/local/zookeeper&lt;/value&gt;
&lt;description&gt;Property from ZooKeeper's config zoo.cfg.
The directory where the snapshot is stored.
&lt;/description&gt;
&lt;/property&gt;
...
&lt;/configuration&gt;</programlisting></para>
<section>
<title>Using existing ZooKeeper ensemble</title>
<para>To point HBase at an existing ZooKeeper cluster, one that
is not managed by HBase, set <varname>HBASE_MANAGES_ZK</varname>
in <filename>conf/hbase-env.sh</filename> to false
<programlisting>
...
# Tell HBase whether it should manage its own instance of Zookeeper or not.
export HBASE_MANAGES_ZK=false</programlisting> Next set ensemble locations
and client port, if non-standard, in
<filename>hbase-site.xml</filename>, or add a suitably
configured <filename>zoo.cfg</filename> to HBase's
<filename>CLASSPATH</filename>. HBase will prefer the
configuration found in <filename>zoo.cfg</filename> over any
settings in <filename>hbase-site.xml</filename>.</para>
<para>When HBase manages ZooKeeper, it will start/stop the
ZooKeeper servers as a part of the regular start/stop scripts.
If you would like to run ZooKeeper yourself, independent of
HBase start/stop, you would do the following</para>
<programlisting>
${HBASE_HOME}/bin/hbase-daemons.sh {start,stop} zookeeper
</programlisting>
<para>Note that you can use HBase in this manner to spin up a
ZooKeeper cluster, unrelated to HBase. Just make sure to set
<varname>HBASE_MANAGES_ZK</varname> to <varname>false</varname>
if you want it to stay up across HBase restarts so that when
HBase shuts down, it doesn't take ZooKeeper down with it.</para>
<para>For more information about running a distinct ZooKeeper
cluster, see the ZooKeeper <link
xlink:href="http://hadoop.apache.org/zookeeper/docs/current/zookeeperStarted.html">Getting
Started Guide</link>. Additionally, see the <link xlink:href="http://wiki.apache.org/hadoop/ZooKeeper/FAQ#A7">ZooKeeper Wiki</link> or the
<link xlink:href="http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#sc_zkMulitServerSetup">ZooKeeper documentation</link>
for more information on ZooKeeper sizing.
</para>
</section>
<section xml:id="zk.sasl.auth">
<title>SASL Authentication with ZooKeeper</title>
<para>Newer releases of HBase (&gt;= 0.92) will
support connecting to a ZooKeeper Quorum that supports
SASL authentication (which is available in Zookeeper
versions 3.4.0 or later).</para>
<para>This describes how to set up HBase to mutually
authenticate with a ZooKeeper Quorum. ZooKeeper/HBase
mutual authentication (<link
xlink:href="https://issues.apache.org/jira/browse/HBASE-2418">HBASE-2418</link>)
is required as part of a complete secure HBase configuration
(<link
xlink:href="https://issues.apache.org/jira/browse/HBASE-3025">HBASE-3025</link>).
For simplicity of explication, this section ignores
additional configuration required (Secure HDFS and Coprocessor
configuration). It's recommended to begin with an
HBase-managed Zookeeper configuration (as opposed to a
standalone Zookeeper quorum) for ease of learning.
</para>
<section><title>Operating System Prerequisites</title></section>
<para>
You need to have a working Kerberos KDC setup. For
each <code>$HOST</code> that will run a ZooKeeper
server, you should have a principle
<code>zookeeper/$HOST</code>. For each such host,
add a service key (using the <code>kadmin</code> or
<code>kadmin.local</code> tool's <code>ktadd</code>
command) for <code>zookeeper/$HOST</code> and copy
this file to <code>$HOST</code>, and make it
readable only to the user that will run zookeeper on
<code>$HOST</code>. Note the location of this file,
which we will use below as
<filename>$PATH_TO_ZOOKEEPER_KEYTAB</filename>.
</para>
<para>
Similarly, for each <code>$HOST</code> that will run
an HBase server (master or regionserver), you should
have a principle: <code>hbase/$HOST</code>. For each
host, add a keytab file called
<filename>hbase.keytab</filename> containing a service
key for <code>hbase/$HOST</code>, copy this file to
<code>$HOST</code>, and make it readable only to the
user that will run an HBase service on
<code>$HOST</code>. Note the location of this file,
which we will use below as
<filename>$PATH_TO_HBASE_KEYTAB</filename>.
</para>
<para>
Each user who will be an HBase client should also be
given a Kerberos principal. This principal should
usually have a password assigned to it (as opposed to,
as with the HBase servers, a keytab file) which only
this user knows. The client's principal's
<code>maxrenewlife</code> should be set so that it can
be renewed enough so that the user can complete their
HBase client processes. For example, if a user runs a
long-running HBase client process that takes at most 3
days, we might create this user's principal within
<code>kadmin</code> with: <code>addprinc -maxrenewlife
3days</code>. The Zookeeper client and server
libraries manage their own ticket refreshment by
running threads that wake up periodically to do the
refreshment.
</para>
<para>On each host that will run an HBase client
(e.g. <code>hbase shell</code>), add the following
file to the HBase home directory's <filename>conf</filename>
directory:</para>
<programlisting>
Client {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=false
useTicketCache=true;
};
</programlisting>
<para>We'll refer to this JAAS configuration file as
<filename>$CLIENT_CONF</filename> below.</para>
<section>
<title>HBase-managed Zookeeper Configuration</title>
<para>On each node that will run a zookeeper, a
master, or a regionserver, create a <link
xlink:href="http://docs.oracle.com/javase/1.4.2/docs/guide/security/jgss/tutorials/LoginConfigFile.html">JAAS</link>
configuration file in the conf directory of the node's
<filename>HBASE_HOME</filename> directory that looks like the
following:</para>
<programlisting>
Server {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
keyTab="$PATH_TO_ZOOKEEPER_KEYTAB"
storeKey=true
useTicketCache=false
principal="zookeeper/$HOST";
};
Client {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
useTicketCache=false
keyTab="$PATH_TO_HBASE_KEYTAB"
principal="hbase/$HOST";
};
</programlisting>
where the <filename>$PATH_TO_HBASE_KEYTAB</filename> and
<filename>$PATH_TO_ZOOKEEPER_KEYTAB</filename> files are what
you created above, and <code>$HOST</code> is the hostname for that
node.
<para>The <code>Server</code> section will be used by
the Zookeeper quorum server, while the
<code>Client</code> section will be used by the HBase
master and regionservers. The path to this file should
be substituted for the text <filename>$HBASE_SERVER_CONF</filename>
in the <filename>hbase-env.sh</filename>
listing below.</para>
<para>
The path to this file should be substituted for the
text <filename>$CLIENT_CONF</filename> in the
<filename>hbase-env.sh</filename> listing below.
</para>
<para>Modify your <filename>hbase-env.sh</filename> to include the
following:</para>
<programlisting>
export HBASE_OPTS="-Djava.security.auth.login.config=$CLIENT_CONF"
export HBASE_MANAGES_ZK=true
export HBASE_ZOOKEEPER_OPTS="-Djava.security.auth.login.config=$HBASE_SERVER_CONF"
export HBASE_MASTER_OPTS="-Djava.security.auth.login.config=$HBASE_SERVER_CONF"
export HBASE_REGIONSERVER_OPTS="-Djava.security.auth.login.config=$HBASE_SERVER_CONF"
</programlisting>
where <filename>$HBASE_SERVER_CONF</filename> and
<filename>$CLIENT_CONF</filename> are the full paths to the
JAAS configuration files created above.
<para>Modify your <filename>hbase-site.xml</filename> on each node
that will run zookeeper, master or regionserver to contain:</para>
<programlisting><![CDATA[
<configuration>
<property>
<name>hbase.zookeeper.quorum</name>
<value>$ZK_NODES</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.zookeeper.property.authProvider.1</name>
<value>org.apache.zookeeper.server.auth.SASLAuthenticationProvider</value>
</property>
<property>
<name>hbase.zookeeper.property.kerberos.removeHostFromPrincipal</name>
<value>true</value>
</property>
<property>
<name>hbase.zookeeper.property.kerberos.removeRealmFromPrincipal</name>
<value>true</value>
</property>
</configuration>
]]></programlisting>
<para>where <code>$ZK_NODES</code> is the
comma-separated list of hostnames of the Zookeeper
Quorum hosts.</para>
<para>Start your hbase cluster by running one or more
of the following set of commands on the appropriate
hosts:
</para>
<programlisting>
bin/hbase zookeeper start
bin/hbase master start
bin/hbase regionserver start
</programlisting>
</section>
<section><title>External Zookeeper Configuration</title>
<para>Add a JAAS configuration file that looks like:
<programlisting>
Client {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
useTicketCache=false
keyTab="$PATH_TO_HBASE_KEYTAB"
principal="hbase/$HOST";
};
</programlisting>
where the <filename>$PATH_TO_HBASE_KEYTAB</filename> is the keytab
created above for HBase services to run on this host, and <code>$HOST</code> is the
hostname for that node. Put this in the HBase home's
configuration directory. We'll refer to this file's
full pathname as <filename>$HBASE_SERVER_CONF</filename> below.</para>
<para>Modify your hbase-env.sh to include the following:</para>
<programlisting>
export HBASE_OPTS="-Djava.security.auth.login.config=$CLIENT_CONF"
export HBASE_MANAGES_ZK=false
export HBASE_MASTER_OPTS="-Djava.security.auth.login.config=$HBASE_SERVER_CONF"
export HBASE_REGIONSERVER_OPTS="-Djava.security.auth.login.config=$HBASE_SERVER_CONF"
</programlisting>
<para>Modify your <filename>hbase-site.xml</filename> on each node
that will run a master or regionserver to contain:</para>
<programlisting><![CDATA[
<configuration>
<property>
<name>hbase.zookeeper.quorum</name>
<value>$ZK_NODES</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
</configuration>
]]>
</programlisting>
<para>where <code>$ZK_NODES</code> is the
comma-separated list of hostnames of the Zookeeper
Quorum hosts.</para>
<para>
Add a <filename>zoo.cfg</filename> for each Zookeeper Quorum host containing:
<programlisting>
authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider
kerberos.removeHostFromPrincipal=true
kerberos.removeRealmFromPrincipal=true
</programlisting>
Also on each of these hosts, create a JAAS configuration file containing:
<programlisting>
Server {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
keyTab="$PATH_TO_ZOOKEEPER_KEYTAB"
storeKey=true
useTicketCache=false
principal="zookeeper/$HOST";
};
</programlisting>
where <code>$HOST</code> is the hostname of each
Quorum host. We will refer to the full pathname of
this file as <filename>$ZK_SERVER_CONF</filename> below.
</para>
<para>
Start your Zookeepers on each Zookeeper Quorum host with:
<programlisting>
SERVER_JVMFLAGS="-Djava.security.auth.login.config=$ZK_SERVER_CONF" bin/zkServer start
</programlisting>
</para>
<para>
Start your HBase cluster by running one or more of the following set of commands on the appropriate nodes:
</para>
<programlisting>
bin/hbase master start
bin/hbase regionserver start
</programlisting>
</section>
<section>
<title>Zookeeper Server Authentication Log Output</title>
<para>If the configuration above is successful,
you should see something similar to the following in
your Zookeeper server logs:
<programlisting>
11/12/05 22:43:39 INFO zookeeper.Login: successfully logged in.
11/12/05 22:43:39 INFO server.NIOServerCnxnFactory: binding to port 0.0.0.0/0.0.0.0:2181
11/12/05 22:43:39 INFO zookeeper.Login: TGT refresh thread started.
11/12/05 22:43:39 INFO zookeeper.Login: TGT valid starting at: Mon Dec 05 22:43:39 UTC 2011
11/12/05 22:43:39 INFO zookeeper.Login: TGT expires: Tue Dec 06 22:43:39 UTC 2011
11/12/05 22:43:39 INFO zookeeper.Login: TGT refresh sleeping until: Tue Dec 06 18:36:42 UTC 2011
..
11/12/05 22:43:59 INFO auth.SaslServerCallbackHandler:
Successfully authenticated client: authenticationID=hbase/ip-10-166-175-249.us-west-1.compute.internal@HADOOP.LOCALDOMAIN;
authorizationID=hbase/ip-10-166-175-249.us-west-1.compute.internal@HADOOP.LOCALDOMAIN.
11/12/05 22:43:59 INFO auth.SaslServerCallbackHandler: Setting authorizedID: hbase
11/12/05 22:43:59 INFO server.ZooKeeperServer: adding SASL authorization for authorizationID: hbase
</programlisting>
</para>
</section>
<section>
<title>Zookeeper Client Authentication Log Output</title>
<para>On the Zookeeper client side (HBase master or regionserver),
you should see something similar to the following:
<programlisting>
11/12/05 22:43:59 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=ip-10-166-175-249.us-west-1.compute.internal:2181 sessionTimeout=180000 watcher=master:60000
11/12/05 22:43:59 INFO zookeeper.ClientCnxn: Opening socket connection to server /10.166.175.249:2181
11/12/05 22:43:59 INFO zookeeper.RecoverableZooKeeper: The identifier of this process is 14851@ip-10-166-175-249
11/12/05 22:43:59 INFO zookeeper.Login: successfully logged in.
11/12/05 22:43:59 INFO client.ZooKeeperSaslClient: Client will use GSSAPI as SASL mechanism.
11/12/05 22:43:59 INFO zookeeper.Login: TGT refresh thread started.
11/12/05 22:43:59 INFO zookeeper.ClientCnxn: Socket connection established to ip-10-166-175-249.us-west-1.compute.internal/10.166.175.249:2181, initiating session
11/12/05 22:43:59 INFO zookeeper.Login: TGT valid starting at: Mon Dec 05 22:43:59 UTC 2011
11/12/05 22:43:59 INFO zookeeper.Login: TGT expires: Tue Dec 06 22:43:59 UTC 2011
11/12/05 22:43:59 INFO zookeeper.Login: TGT refresh sleeping until: Tue Dec 06 18:30:37 UTC 2011
11/12/05 22:43:59 INFO zookeeper.ClientCnxn: Session establishment complete on server ip-10-166-175-249.us-west-1.compute.internal/10.166.175.249:2181, sessionid = 0x134106594320000, negotiated timeout = 180000
</programlisting>
</para>
</section>
<section>
<title>Configuration from Scratch</title>
This has been tested on the current standard Amazon
Linux AMI. First setup KDC and principals as
described above. Next checkout code and run a sanity
check.
<programlisting>
git clone git://git.apache.org/hbase.git
cd hbase
mvn -PlocalTests clean test -Dtest=TestZooKeeperACL
</programlisting>
Then configure HBase as described above.
Manually edit target/cached_classpath.txt (see below)..
<programlisting>
bin/hbase zookeeper &amp;
bin/hbase master &amp;
bin/hbase regionserver &amp;
</programlisting>
</section>
<section>
<title>Future improvements</title>
<section><title>Fix target/cached_classpath.txt</title>
<para>
You must override the standard hadoop-core jar file from the
<code>target/cached_classpath.txt</code>
file with the version containing the HADOOP-7070 fix. You can use the following script to do this:
<programlisting>
echo `find ~/.m2 -name "*hadoop-core*7070*SNAPSHOT.jar"` ':' `cat target/cached_classpath.txt` | sed 's/ //g' > target/tmp.txt
mv target/tmp.txt target/cached_classpath.txt
</programlisting>
</para>
</section>
<section>
<title>Set JAAS configuration
programmatically</title>
This would avoid the need for a separate Hadoop jar
that fixes <link xlink:href="https://issues.apache.org/jira/browse/HADOOP-7070">HADOOP-7070</link>.
</section>
<section>
<title>Elimination of
<code>kerberos.removeHostFromPrincipal</code> and
<code>kerberos.removeRealmFromPrincipal</code></title>
</section>
</section>
</section> <!-- SASL Authentication with ZooKeeper -->
</section> <!-- zookeeper -->
<section xml:id="config.files">
@ -1704,34 +1151,4 @@ of all regions.
</section> <!-- important config -->
<section xml:id="config.bloom">
<title>Bloom Filter Configuration</title>
<section>
<title><varname>io.hfile.bloom.enabled</varname> global kill
switch</title>
<para><code>io.hfile.bloom.enabled</code> in
<classname>Configuration</classname> serves as the kill switch in case
something goes wrong. Default = <varname>true</varname>.</para>
</section>
<section>
<title><varname>io.hfile.bloom.error.rate</varname></title>
<para><varname>io.hfile.bloom.error.rate</varname> = average false
positive rate. Default = 1%. Decrease rate by ½ (e.g. to .5%) == +1
bit per bloom entry.</para>
</section>
<section>
<title><varname>io.hfile.bloom.max.fold</varname></title>
<para><varname>io.hfile.bloom.max.fold</varname> = guaranteed minimum
fold rate. Most people should leave this alone. Default = 7, or can
collapse to at least 1/128th of original size. See the
<emphasis>Development Process</emphasis> section of the document <link
xlink:href="https://issues.apache.org/jira/secure/attachment/12444007/Bloom_Filters_in_HBase.pdf">BloomFilters
in HBase</link> for more on what this option means.</para>
</section>
</section>
</chapter>

View File

@ -33,8 +33,9 @@
<para><xref linkend="quickstart" /> will get you up and
running on a single-node instance of HBase using the local filesystem.
<xref linkend="configuration" /> describes setup
of HBase in distributed mode running on top of HDFS.</para>
<xref linkend="configuration" /> describes basic system
requirements and configuration running HBase in distributed mode
on top of HDFS.</para>
</section>
<section xml:id="quickstart">
@ -51,7 +52,7 @@
<para>Choose a download site from this list of <link
xlink:href="http://www.apache.org/dyn/closer.cgi/hbase/">Apache Download
Mirrors</link>. Click on suggested top link. This will take you to a
Mirrors</link>. Click on the suggested top link. This will take you to a
mirror of <emphasis>HBase Releases</emphasis>. Click on the folder named
<filename>stable</filename> and then download the file that ends in
<filename>.tar.gz</filename> to your local filesystem; e.g.
@ -65,24 +66,21 @@ $ cd hbase-<?eval ${project.version}?>
</programlisting></para>
<para>At this point, you are ready to start HBase. But before starting
it, you might want to edit <filename>conf/hbase-site.xml</filename> and
set the directory you want HBase to write to,
<varname>hbase.rootdir</varname>. <programlisting>
&lt;?xml version="1.0"?&gt;
it, you might want to edit <filename>conf/hbase-site.xml</filename>, the
file you write your site-specific configurations into, and
set <varname>hbase.rootdir</varname>, the directory HBase writes data to,
<programlisting>&lt;?xml version="1.0"?&gt;
&lt;?xml-stylesheet type="text/xsl" href="configuration.xsl"?&gt;
&lt;configuration&gt;
&lt;property&gt;
&lt;name&gt;hbase.rootdir&lt;/name&gt;
&lt;value&gt;file:///DIRECTORY/hbase&lt;/value&gt;
&lt;/property&gt;
&lt;/configuration&gt;
</programlisting> Replace <varname>DIRECTORY</varname> in the above with a
path to a directory where you want HBase to store its data. By default,
&lt;/configuration&gt;</programlisting> Replace <varname>DIRECTORY</varname> in the above with the
path to the directory where you want HBase to store its data. By default,
<varname>hbase.rootdir</varname> is set to
<filename>/tmp/hbase-${user.name}</filename> which means you'll lose all
your data whenever your server reboots (Most operating systems clear
your data whenever your server reboots unless you change it (Most operating systems clear
<filename>/tmp</filename> on restart).</para>
</section>
@ -96,7 +94,7 @@ starting Master, logging to logs/hbase-user-master-example.org.out</programlisti
standalone mode, HBase runs all daemons in the the one JVM; i.e. both
the HBase and ZooKeeper daemons. HBase logs can be found in the
<filename>logs</filename> subdirectory. Check them out especially if
HBase had trouble starting.</para>
it seems HBase had trouble starting.</para>
<note>
<title>Is <application>java</application> installed?</title>
@ -108,7 +106,7 @@ starting Master, logging to logs/hbase-user-master-example.org.out</programlisti
options the java program takes (HBase requires java 6). If this is not
the case, HBase will not start. Install java, edit
<filename>conf/hbase-env.sh</filename>, uncommenting the
<envar>JAVA_HOME</envar> line pointing it to your java install. Then,
<envar>JAVA_HOME</envar> line pointing it to your java install, then,
retry the steps above.</para>
</note>
</section>
@ -154,9 +152,7 @@ hbase(main):006:0&gt; put 'test', 'row3', 'cf:c', 'value3'
<varname>cf</varname> in this example -- followed by a colon and then a
column qualifier suffix (<varname>a</varname> in this case).</para>
<para>Verify the data insert.</para>
<para>Run a scan of the table by doing the following</para>
<para>Verify the data insert by running a scan of the table as follows</para>
<para><programlisting>hbase(main):007:0&gt; scan 'test'
ROW COLUMN+CELL
@ -165,7 +161,7 @@ row2 column=cf:b, timestamp=1288380738440, value=value2
row3 column=cf:c, timestamp=1288380747365, value=value3
3 row(s) in 0.0590 seconds</programlisting></para>
<para>Get a single row as follows</para>
<para>Get a single row</para>
<para><programlisting>hbase(main):008:0&gt; get 'test', 'row1'
COLUMN CELL
@ -198,9 +194,9 @@ stopping hbase...............</programlisting></para>
<title>Where to go next</title>
<para>The above described standalone setup is good for testing and
experiments only. Next move on to <xref linkend="configuration" /> where we'll go into
depth on the different HBase run modes, requirements and critical
configurations needed setting up a distributed HBase deploy.</para>
experiments only. In the next chapter, <xref linkend="configuration" />,
we'll go into depth on the different HBase run modes, system requirements
running HBase, and critical configurations setting up a distributed HBase deploy.</para>
</section>
</section>

View File

@ -526,6 +526,96 @@ htable.close();</programlisting></para>
too few regions then the reads could likely be served from too few nodes. </para>
<para>See <xref linkend="precreate.regions"/>, as well as <xref linkend="perf.configurations"/> </para>
</section>
<section xml:id="blooms">
<title>Bloom Filters</title>
<para>Enabling Bloom Filters can save your having to go to disk and
can help improve read latencys.</para>
<para><link xlink:href="http://en.wikipedia.org/wiki/Bloom_filter">Bloom filters</link> were developed over in <link
xlink:href="https://issues.apache.org/jira/browse/HBASE-1200">HBase-1200
Add bloomfilters</link>.<footnote>
<para>For description of the development process -- why static blooms
rather than dynamic -- and for an overview of the unique properties
that pertain to blooms in HBase, as well as possible future
directions, see the <emphasis>Development Process</emphasis> section
of the document <link
xlink:href="https://issues.apache.org/jira/secure/attachment/12444007/Bloom_Filters_in_HBase.pdf">BloomFilters
in HBase</link> attached to <link
xlink:href="https://issues.apache.org/jira/browse/HBASE-1200">HBase-1200</link>.</para>
</footnote><footnote>
<para>The bloom filters described here are actually version two of
blooms in HBase. In versions up to 0.19.x, HBase had a dynamic bloom
option based on work done by the <link
xlink:href="http://www.one-lab.org">European Commission One-Lab
Project 034819</link>. The core of the HBase bloom work was later
pulled up into Hadoop to implement org.apache.hadoop.io.BloomMapFile.
Version 1 of HBase blooms never worked that well. Version 2 is a
rewrite from scratch though again it starts with the one-lab
work.</para>
</footnote></para>
<para>See also <xref linkend="schema.bloom" />.
</para>
<section xml:id="bloom_footprint">
<title>Bloom StoreFile footprint</title>
<para>Bloom filters add an entry to the <classname>StoreFile</classname>
general <classname>FileInfo</classname> data structure and then two
extra entries to the <classname>StoreFile</classname> metadata
section.</para>
<section>
<title>BloomFilter in the <classname>StoreFile</classname>
<classname>FileInfo</classname> data structure</title>
<para><classname>FileInfo</classname> has a
<varname>BLOOM_FILTER_TYPE</varname> entry which is set to
<varname>NONE</varname>, <varname>ROW</varname> or
<varname>ROWCOL.</varname></para>
</section>
<section>
<title>BloomFilter entries in <classname>StoreFile</classname>
metadata</title>
<para><varname>BLOOM_FILTER_META</varname> holds Bloom Size, Hash
Function used, etc. Its small in size and is cached on
<classname>StoreFile.Reader</classname> load</para>
<para><varname>BLOOM_FILTER_DATA</varname> is the actual bloomfilter
data. Obtained on-demand. Stored in the LRU cache, if it is enabled
(Its enabled by default).</para>
</section>
</section>
<section xml:id="config.bloom">
<title>Bloom Filter Configuration</title>
<section>
<title><varname>io.hfile.bloom.enabled</varname> global kill
switch</title>
<para><code>io.hfile.bloom.enabled</code> in
<classname>Configuration</classname> serves as the kill switch in case
something goes wrong. Default = <varname>true</varname>.</para>
</section>
<section>
<title><varname>io.hfile.bloom.error.rate</varname></title>
<para><varname>io.hfile.bloom.error.rate</varname> = average false
positive rate. Default = 1%. Decrease rate by ½ (e.g. to .5%) == +1
bit per bloom entry.</para>
</section>
<section>
<title><varname>io.hfile.bloom.max.fold</varname></title>
<para><varname>io.hfile.bloom.max.fold</varname> = guaranteed minimum
fold rate. Most people should leave this alone. Default = 7, or can
collapse to at least 1/128th of original size. See the
<emphasis>Development Process</emphasis> section of the document <link
xlink:href="https://issues.apache.org/jira/secure/attachment/12444007/Bloom_Filters_in_HBase.pdf">BloomFilters
in HBase</link> for more on what this option means.</para>
</section>
</section>
</section> <!-- bloom -->
</section> <!-- reading -->

586
src/docbkx/zookeeper.xml Normal file
View File

@ -0,0 +1,586 @@
<?xml version="1.0"?>
<chapter xml:id="zookeeper"
version="5.0" xmlns="http://docbook.org/ns/docbook"
xmlns:xlink="http://www.w3.org/1999/xlink"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:svg="http://www.w3.org/2000/svg"
xmlns:m="http://www.w3.org/1998/Math/MathML"
xmlns:html="http://www.w3.org/1999/xhtml"
xmlns:db="http://docbook.org/ns/docbook">
<!--
/**
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
-->
<title>ZooKeeper<indexterm>
<primary>ZooKeeper</primary>
</indexterm></title>
<para>A distributed HBase depends on a running ZooKeeper cluster.
All participating nodes and clients need to be able to access the
running ZooKeeper ensemble. HBase by default manages a ZooKeeper
"cluster" for you. It will start and stop the ZooKeeper ensemble
as part of the HBase start/stop process. You can also manage the
ZooKeeper ensemble independent of HBase and just point HBase at
the cluster it should use. To toggle HBase management of
ZooKeeper, use the <varname>HBASE_MANAGES_ZK</varname> variable in
<filename>conf/hbase-env.sh</filename>. This variable, which
defaults to <varname>true</varname>, tells HBase whether to
start/stop the ZooKeeper ensemble servers as part of HBase
start/stop.</para>
<para>When HBase manages the ZooKeeper ensemble, you can specify
ZooKeeper configuration using its native
<filename>zoo.cfg</filename> file, or, the easier option is to
just specify ZooKeeper options directly in
<filename>conf/hbase-site.xml</filename>. A ZooKeeper
configuration option can be set as a property in the HBase
<filename>hbase-site.xml</filename> XML configuration file by
prefacing the ZooKeeper option name with
<varname>hbase.zookeeper.property</varname>. For example, the
<varname>clientPort</varname> setting in ZooKeeper can be changed
by setting the
<varname>hbase.zookeeper.property.clientPort</varname> property.
For all default values used by HBase, including ZooKeeper
configuration, see <xref linkend="hbase_default_configurations" />. Look for the
<varname>hbase.zookeeper.property</varname> prefix <footnote>
<para>For the full list of ZooKeeper configurations, see
ZooKeeper's <filename>zoo.cfg</filename>. HBase does not ship
with a <filename>zoo.cfg</filename> so you will need to browse
the <filename>conf</filename> directory in an appropriate
ZooKeeper download.</para>
</footnote></para>
<para>You must at least list the ensemble servers in
<filename>hbase-site.xml</filename> using the
<varname>hbase.zookeeper.quorum</varname> property. This property
defaults to a single ensemble member at
<varname>localhost</varname> which is not suitable for a fully
distributed HBase. (It binds to the local machine only and remote
clients will not be able to connect). <note xml:id="how_many_zks">
<title>How many ZooKeepers should I run?</title>
<para>You can run a ZooKeeper ensemble that comprises 1 node
only but in production it is recommended that you run a
ZooKeeper ensemble of 3, 5 or 7 machines; the more members an
ensemble has, the more tolerant the ensemble is of host
failures. Also, run an odd number of machines. In ZooKeeper,
an even number of peers is supported, but it is normally not used
because an even sized ensemble requires, proportionally, more peers
to form a quorum than an odd sized ensemble requires. For example, an
ensemble with 4 peers requires 3 to form a quorum, while an ensemble with
5 also requires 3 to form a quorum. Thus, an ensemble of 5 allows 2 peers to
fail, and thus is more fault tolerant than the ensemble of 4, which allows
only 1 down peer.
</para>
<para>Give each ZooKeeper server around 1GB of RAM, and if possible, its own
dedicated disk (A dedicated disk is the best thing you can do
to ensure a performant ZooKeeper ensemble). For very heavily
loaded clusters, run ZooKeeper servers on separate machines
from RegionServers (DataNodes and TaskTrackers).</para>
</note></para>
<para>For example, to have HBase manage a ZooKeeper quorum on
nodes <emphasis>rs{1,2,3,4,5}.example.com</emphasis>, bound to
port 2222 (the default is 2181) ensure
<varname>HBASE_MANAGE_ZK</varname> is commented out or set to
<varname>true</varname> in <filename>conf/hbase-env.sh</filename>
and then edit <filename>conf/hbase-site.xml</filename> and set
<varname>hbase.zookeeper.property.clientPort</varname> and
<varname>hbase.zookeeper.quorum</varname>. You should also set
<varname>hbase.zookeeper.property.dataDir</varname> to other than
the default as the default has ZooKeeper persist data under
<filename>/tmp</filename> which is often cleared on system
restart. In the example below we have ZooKeeper persist to
<filename>/user/local/zookeeper</filename>. <programlisting>
&lt;configuration&gt;
...
&lt;property&gt;
&lt;name&gt;hbase.zookeeper.property.clientPort&lt;/name&gt;
&lt;value&gt;2222&lt;/value&gt;
&lt;description&gt;Property from ZooKeeper's config zoo.cfg.
The port at which the clients will connect.
&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;hbase.zookeeper.quorum&lt;/name&gt;
&lt;value&gt;rs1.example.com,rs2.example.com,rs3.example.com,rs4.example.com,rs5.example.com&lt;/value&gt;
&lt;description&gt;Comma separated list of servers in the ZooKeeper Quorum.
For example, "host1.mydomain.com,host2.mydomain.com,host3.mydomain.com".
By default this is set to localhost for local and pseudo-distributed modes
of operation. For a fully-distributed setup, this should be set to a full
list of ZooKeeper quorum servers. If HBASE_MANAGES_ZK is set in hbase-env.sh
this is the list of servers which we will start/stop ZooKeeper on.
&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;hbase.zookeeper.property.dataDir&lt;/name&gt;
&lt;value&gt;/usr/local/zookeeper&lt;/value&gt;
&lt;description&gt;Property from ZooKeeper's config zoo.cfg.
The directory where the snapshot is stored.
&lt;/description&gt;
&lt;/property&gt;
...
&lt;/configuration&gt;</programlisting></para>
<section>
<title>Using existing ZooKeeper ensemble</title>
<para>To point HBase at an existing ZooKeeper cluster, one that
is not managed by HBase, set <varname>HBASE_MANAGES_ZK</varname>
in <filename>conf/hbase-env.sh</filename> to false
<programlisting>
...
# Tell HBase whether it should manage its own instance of Zookeeper or not.
export HBASE_MANAGES_ZK=false</programlisting> Next set ensemble locations
and client port, if non-standard, in
<filename>hbase-site.xml</filename>, or add a suitably
configured <filename>zoo.cfg</filename> to HBase's
<filename>CLASSPATH</filename>. HBase will prefer the
configuration found in <filename>zoo.cfg</filename> over any
settings in <filename>hbase-site.xml</filename>.</para>
<para>When HBase manages ZooKeeper, it will start/stop the
ZooKeeper servers as a part of the regular start/stop scripts.
If you would like to run ZooKeeper yourself, independent of
HBase start/stop, you would do the following</para>
<programlisting>
${HBASE_HOME}/bin/hbase-daemons.sh {start,stop} zookeeper
</programlisting>
<para>Note that you can use HBase in this manner to spin up a
ZooKeeper cluster, unrelated to HBase. Just make sure to set
<varname>HBASE_MANAGES_ZK</varname> to <varname>false</varname>
if you want it to stay up across HBase restarts so that when
HBase shuts down, it doesn't take ZooKeeper down with it.</para>
<para>For more information about running a distinct ZooKeeper
cluster, see the ZooKeeper <link
xlink:href="http://hadoop.apache.org/zookeeper/docs/current/zookeeperStarted.html">Getting
Started Guide</link>. Additionally, see the <link xlink:href="http://wiki.apache.org/hadoop/ZooKeeper/FAQ#A7">ZooKeeper Wiki</link> or the
<link xlink:href="http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#sc_zkMulitServerSetup">ZooKeeper documentation</link>
for more information on ZooKeeper sizing.
</para>
</section>
<section xml:id="zk.sasl.auth">
<title>SASL Authentication with ZooKeeper</title>
<para>Newer releases of HBase (&gt;= 0.92) will
support connecting to a ZooKeeper Quorum that supports
SASL authentication (which is available in Zookeeper
versions 3.4.0 or later).</para>
<para>This describes how to set up HBase to mutually
authenticate with a ZooKeeper Quorum. ZooKeeper/HBase
mutual authentication (<link
xlink:href="https://issues.apache.org/jira/browse/HBASE-2418">HBASE-2418</link>)
is required as part of a complete secure HBase configuration
(<link
xlink:href="https://issues.apache.org/jira/browse/HBASE-3025">HBASE-3025</link>).
For simplicity of explication, this section ignores
additional configuration required (Secure HDFS and Coprocessor
configuration). It's recommended to begin with an
HBase-managed Zookeeper configuration (as opposed to a
standalone Zookeeper quorum) for ease of learning.
</para>
<section><title>Operating System Prerequisites</title></section>
<para>
You need to have a working Kerberos KDC setup. For
each <code>$HOST</code> that will run a ZooKeeper
server, you should have a principle
<code>zookeeper/$HOST</code>. For each such host,
add a service key (using the <code>kadmin</code> or
<code>kadmin.local</code> tool's <code>ktadd</code>
command) for <code>zookeeper/$HOST</code> and copy
this file to <code>$HOST</code>, and make it
readable only to the user that will run zookeeper on
<code>$HOST</code>. Note the location of this file,
which we will use below as
<filename>$PATH_TO_ZOOKEEPER_KEYTAB</filename>.
</para>
<para>
Similarly, for each <code>$HOST</code> that will run
an HBase server (master or regionserver), you should
have a principle: <code>hbase/$HOST</code>. For each
host, add a keytab file called
<filename>hbase.keytab</filename> containing a service
key for <code>hbase/$HOST</code>, copy this file to
<code>$HOST</code>, and make it readable only to the
user that will run an HBase service on
<code>$HOST</code>. Note the location of this file,
which we will use below as
<filename>$PATH_TO_HBASE_KEYTAB</filename>.
</para>
<para>
Each user who will be an HBase client should also be
given a Kerberos principal. This principal should
usually have a password assigned to it (as opposed to,
as with the HBase servers, a keytab file) which only
this user knows. The client's principal's
<code>maxrenewlife</code> should be set so that it can
be renewed enough so that the user can complete their
HBase client processes. For example, if a user runs a
long-running HBase client process that takes at most 3
days, we might create this user's principal within
<code>kadmin</code> with: <code>addprinc -maxrenewlife
3days</code>. The Zookeeper client and server
libraries manage their own ticket refreshment by
running threads that wake up periodically to do the
refreshment.
</para>
<para>On each host that will run an HBase client
(e.g. <code>hbase shell</code>), add the following
file to the HBase home directory's <filename>conf</filename>
directory:</para>
<programlisting>
Client {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=false
useTicketCache=true;
};
</programlisting>
<para>We'll refer to this JAAS configuration file as
<filename>$CLIENT_CONF</filename> below.</para>
<section>
<title>HBase-managed Zookeeper Configuration</title>
<para>On each node that will run a zookeeper, a
master, or a regionserver, create a <link
xlink:href="http://docs.oracle.com/javase/1.4.2/docs/guide/security/jgss/tutorials/LoginConfigFile.html">JAAS</link>
configuration file in the conf directory of the node's
<filename>HBASE_HOME</filename> directory that looks like the
following:</para>
<programlisting>
Server {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
keyTab="$PATH_TO_ZOOKEEPER_KEYTAB"
storeKey=true
useTicketCache=false
principal="zookeeper/$HOST";
};
Client {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
useTicketCache=false
keyTab="$PATH_TO_HBASE_KEYTAB"
principal="hbase/$HOST";
};
</programlisting>
where the <filename>$PATH_TO_HBASE_KEYTAB</filename> and
<filename>$PATH_TO_ZOOKEEPER_KEYTAB</filename> files are what
you created above, and <code>$HOST</code> is the hostname for that
node.
<para>The <code>Server</code> section will be used by
the Zookeeper quorum server, while the
<code>Client</code> section will be used by the HBase
master and regionservers. The path to this file should
be substituted for the text <filename>$HBASE_SERVER_CONF</filename>
in the <filename>hbase-env.sh</filename>
listing below.</para>
<para>
The path to this file should be substituted for the
text <filename>$CLIENT_CONF</filename> in the
<filename>hbase-env.sh</filename> listing below.
</para>
<para>Modify your <filename>hbase-env.sh</filename> to include the
following:</para>
<programlisting>
export HBASE_OPTS="-Djava.security.auth.login.config=$CLIENT_CONF"
export HBASE_MANAGES_ZK=true
export HBASE_ZOOKEEPER_OPTS="-Djava.security.auth.login.config=$HBASE_SERVER_CONF"
export HBASE_MASTER_OPTS="-Djava.security.auth.login.config=$HBASE_SERVER_CONF"
export HBASE_REGIONSERVER_OPTS="-Djava.security.auth.login.config=$HBASE_SERVER_CONF"
</programlisting>
where <filename>$HBASE_SERVER_CONF</filename> and
<filename>$CLIENT_CONF</filename> are the full paths to the
JAAS configuration files created above.
<para>Modify your <filename>hbase-site.xml</filename> on each node
that will run zookeeper, master or regionserver to contain:</para>
<programlisting><![CDATA[
<configuration>
<property>
<name>hbase.zookeeper.quorum</name>
<value>$ZK_NODES</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.zookeeper.property.authProvider.1</name>
<value>org.apache.zookeeper.server.auth.SASLAuthenticationProvider</value>
</property>
<property>
<name>hbase.zookeeper.property.kerberos.removeHostFromPrincipal</name>
<value>true</value>
</property>
<property>
<name>hbase.zookeeper.property.kerberos.removeRealmFromPrincipal</name>
<value>true</value>
</property>
</configuration>
]]></programlisting>
<para>where <code>$ZK_NODES</code> is the
comma-separated list of hostnames of the Zookeeper
Quorum hosts.</para>
<para>Start your hbase cluster by running one or more
of the following set of commands on the appropriate
hosts:
</para>
<programlisting>
bin/hbase zookeeper start
bin/hbase master start
bin/hbase regionserver start
</programlisting>
</section>
<section><title>External Zookeeper Configuration</title>
<para>Add a JAAS configuration file that looks like:
<programlisting>
Client {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
useTicketCache=false
keyTab="$PATH_TO_HBASE_KEYTAB"
principal="hbase/$HOST";
};
</programlisting>
where the <filename>$PATH_TO_HBASE_KEYTAB</filename> is the keytab
created above for HBase services to run on this host, and <code>$HOST</code> is the
hostname for that node. Put this in the HBase home's
configuration directory. We'll refer to this file's
full pathname as <filename>$HBASE_SERVER_CONF</filename> below.</para>
<para>Modify your hbase-env.sh to include the following:</para>
<programlisting>
export HBASE_OPTS="-Djava.security.auth.login.config=$CLIENT_CONF"
export HBASE_MANAGES_ZK=false
export HBASE_MASTER_OPTS="-Djava.security.auth.login.config=$HBASE_SERVER_CONF"
export HBASE_REGIONSERVER_OPTS="-Djava.security.auth.login.config=$HBASE_SERVER_CONF"
</programlisting>
<para>Modify your <filename>hbase-site.xml</filename> on each node
that will run a master or regionserver to contain:</para>
<programlisting><![CDATA[
<configuration>
<property>
<name>hbase.zookeeper.quorum</name>
<value>$ZK_NODES</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
</configuration>
]]>
</programlisting>
<para>where <code>$ZK_NODES</code> is the
comma-separated list of hostnames of the Zookeeper
Quorum hosts.</para>
<para>
Add a <filename>zoo.cfg</filename> for each Zookeeper Quorum host containing:
<programlisting>
authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider
kerberos.removeHostFromPrincipal=true
kerberos.removeRealmFromPrincipal=true
</programlisting>
Also on each of these hosts, create a JAAS configuration file containing:
<programlisting>
Server {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
keyTab="$PATH_TO_ZOOKEEPER_KEYTAB"
storeKey=true
useTicketCache=false
principal="zookeeper/$HOST";
};
</programlisting>
where <code>$HOST</code> is the hostname of each
Quorum host. We will refer to the full pathname of
this file as <filename>$ZK_SERVER_CONF</filename> below.
</para>
<para>
Start your Zookeepers on each Zookeeper Quorum host with:
<programlisting>
SERVER_JVMFLAGS="-Djava.security.auth.login.config=$ZK_SERVER_CONF" bin/zkServer start
</programlisting>
</para>
<para>
Start your HBase cluster by running one or more of the following set of commands on the appropriate nodes:
</para>
<programlisting>
bin/hbase master start
bin/hbase regionserver start
</programlisting>
</section>
<section>
<title>Zookeeper Server Authentication Log Output</title>
<para>If the configuration above is successful,
you should see something similar to the following in
your Zookeeper server logs:
<programlisting>
11/12/05 22:43:39 INFO zookeeper.Login: successfully logged in.
11/12/05 22:43:39 INFO server.NIOServerCnxnFactory: binding to port 0.0.0.0/0.0.0.0:2181
11/12/05 22:43:39 INFO zookeeper.Login: TGT refresh thread started.
11/12/05 22:43:39 INFO zookeeper.Login: TGT valid starting at: Mon Dec 05 22:43:39 UTC 2011
11/12/05 22:43:39 INFO zookeeper.Login: TGT expires: Tue Dec 06 22:43:39 UTC 2011
11/12/05 22:43:39 INFO zookeeper.Login: TGT refresh sleeping until: Tue Dec 06 18:36:42 UTC 2011
..
11/12/05 22:43:59 INFO auth.SaslServerCallbackHandler:
Successfully authenticated client: authenticationID=hbase/ip-10-166-175-249.us-west-1.compute.internal@HADOOP.LOCALDOMAIN;
authorizationID=hbase/ip-10-166-175-249.us-west-1.compute.internal@HADOOP.LOCALDOMAIN.
11/12/05 22:43:59 INFO auth.SaslServerCallbackHandler: Setting authorizedID: hbase
11/12/05 22:43:59 INFO server.ZooKeeperServer: adding SASL authorization for authorizationID: hbase
</programlisting>
</para>
</section>
<section>
<title>Zookeeper Client Authentication Log Output</title>
<para>On the Zookeeper client side (HBase master or regionserver),
you should see something similar to the following:
<programlisting>
11/12/05 22:43:59 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=ip-10-166-175-249.us-west-1.compute.internal:2181 sessionTimeout=180000 watcher=master:60000
11/12/05 22:43:59 INFO zookeeper.ClientCnxn: Opening socket connection to server /10.166.175.249:2181
11/12/05 22:43:59 INFO zookeeper.RecoverableZooKeeper: The identifier of this process is 14851@ip-10-166-175-249
11/12/05 22:43:59 INFO zookeeper.Login: successfully logged in.
11/12/05 22:43:59 INFO client.ZooKeeperSaslClient: Client will use GSSAPI as SASL mechanism.
11/12/05 22:43:59 INFO zookeeper.Login: TGT refresh thread started.
11/12/05 22:43:59 INFO zookeeper.ClientCnxn: Socket connection established to ip-10-166-175-249.us-west-1.compute.internal/10.166.175.249:2181, initiating session
11/12/05 22:43:59 INFO zookeeper.Login: TGT valid starting at: Mon Dec 05 22:43:59 UTC 2011
11/12/05 22:43:59 INFO zookeeper.Login: TGT expires: Tue Dec 06 22:43:59 UTC 2011
11/12/05 22:43:59 INFO zookeeper.Login: TGT refresh sleeping until: Tue Dec 06 18:30:37 UTC 2011
11/12/05 22:43:59 INFO zookeeper.ClientCnxn: Session establishment complete on server ip-10-166-175-249.us-west-1.compute.internal/10.166.175.249:2181, sessionid = 0x134106594320000, negotiated timeout = 180000
</programlisting>
</para>
</section>
<section>
<title>Configuration from Scratch</title>
This has been tested on the current standard Amazon
Linux AMI. First setup KDC and principals as
described above. Next checkout code and run a sanity
check.
<programlisting>
git clone git://git.apache.org/hbase.git
cd hbase
mvn -PlocalTests clean test -Dtest=TestZooKeeperACL
</programlisting>
Then configure HBase as described above.
Manually edit target/cached_classpath.txt (see below)..
<programlisting>
bin/hbase zookeeper &amp;
bin/hbase master &amp;
bin/hbase regionserver &amp;
</programlisting>
</section>
<section>
<title>Future improvements</title>
<section><title>Fix target/cached_classpath.txt</title>
<para>
You must override the standard hadoop-core jar file from the
<code>target/cached_classpath.txt</code>
file with the version containing the HADOOP-7070 fix. You can use the following script to do this:
<programlisting>
echo `find ~/.m2 -name "*hadoop-core*7070*SNAPSHOT.jar"` ':' `cat target/cached_classpath.txt` | sed 's/ //g' > target/tmp.txt
mv target/tmp.txt target/cached_classpath.txt
</programlisting>
</para>
</section>
<section>
<title>Set JAAS configuration
programmatically</title>
This would avoid the need for a separate Hadoop jar
that fixes <link xlink:href="https://issues.apache.org/jira/browse/HADOOP-7070">HADOOP-7070</link>.
</section>
<section>
<title>Elimination of
<code>kerberos.removeHostFromPrincipal</code> and
<code>kerberos.removeRealmFromPrincipal</code></title>
</section>
</section>
</section> <!-- SASL Authentication with ZooKeeper -->
</chapter>