HBASE-4006 HBase book - overhaul of configuration information
git-svn-id: https://svn.apache.org/repos/asf/hbase/trunk@1138311 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
parent
73037604e0
commit
a5be25538f
|
@ -64,8 +64,8 @@
|
|||
<!--XInclude some chapters-->
|
||||
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="preface.xml" />
|
||||
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="getting_started.xml" />
|
||||
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="upgrading.xml" />
|
||||
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="configuration.xml" />
|
||||
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="upgrading.xml" />
|
||||
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="shell.xml" />
|
||||
|
||||
|
||||
|
|
|
@ -8,6 +8,11 @@
|
|||
xmlns:html="http://www.w3.org/1999/xhtml"
|
||||
xmlns:db="http://docbook.org/ns/docbook">
|
||||
<title>Configuration</title>
|
||||
<para>This chapter is the Not-So-Quick start guide to HBase configuration.</para>
|
||||
<para>Please read this chapter carefully and ensure that all requirements have
|
||||
been satisfied. Failure to do so will cause you (and us) grief debugging strange errors
|
||||
and/or data loss.</para>
|
||||
|
||||
<para>
|
||||
HBase uses the same configuration system as Hadoop.
|
||||
To configure a deploy, edit a file of environment variables
|
||||
|
@ -31,8 +36,654 @@ to ensure well-formedness of your document after an edit session.
|
|||
content of the <filename>conf</filename> directory to
|
||||
all nodes of the cluster. HBase will not do this for you.
|
||||
Use <command>rsync</command>.</para>
|
||||
|
||||
<section xml:id="java">
|
||||
<title>Java</title>
|
||||
|
||||
<para>Just like Hadoop, HBase requires java 6 from <link
|
||||
xlink:href="http://www.java.com/download/">Oracle</link>. Usually
|
||||
you'll want to use the latest version available except the problematic
|
||||
u18 (u24 is the latest version as of this writing).</para>
|
||||
</section>
|
||||
<section xml:id="os">
|
||||
<title>Operating System</title>
|
||||
<section xml:id="ssh">
|
||||
<title>ssh</title>
|
||||
|
||||
<para><command>ssh</command> must be installed and
|
||||
<command>sshd</command> must be running to use Hadoop's scripts to
|
||||
manage remote Hadoop and HBase daemons. You must be able to ssh to all
|
||||
nodes, including your local node, using passwordless login (Google
|
||||
"ssh passwordless login").</para>
|
||||
</section>
|
||||
|
||||
<section xml:id="dns">
|
||||
<title>DNS</title>
|
||||
|
||||
<para>HBase uses the local hostname to self-report it's IP address.
|
||||
Both forward and reverse DNS resolving should work.</para>
|
||||
|
||||
<para>If your machine has multiple interfaces, HBase will use the
|
||||
interface that the primary hostname resolves to.</para>
|
||||
|
||||
<para>If this is insufficient, you can set
|
||||
<varname>hbase.regionserver.dns.interface</varname> to indicate the
|
||||
primary interface. This only works if your cluster configuration is
|
||||
consistent and every host has the same network interface
|
||||
configuration.</para>
|
||||
|
||||
<para>Another alternative is setting
|
||||
<varname>hbase.regionserver.dns.nameserver</varname> to choose a
|
||||
different nameserver than the system wide default.</para>
|
||||
</section>
|
||||
|
||||
<section xml:id="ntp">
|
||||
<title>NTP</title>
|
||||
|
||||
<para>The clocks on cluster members should be in basic alignments.
|
||||
Some skew is tolerable but wild skew could generate odd behaviors. Run
|
||||
<link
|
||||
xlink:href="http://en.wikipedia.org/wiki/Network_Time_Protocol">NTP</link>
|
||||
on your cluster, or an equivalent.</para>
|
||||
|
||||
<para>If you are having problems querying data, or "weird" cluster
|
||||
operations, check system time!</para>
|
||||
</section>
|
||||
|
||||
<section xml:id="ulimit">
|
||||
<title>
|
||||
<varname>ulimit</varname><indexterm>
|
||||
<primary>ulimit</primary>
|
||||
</indexterm>
|
||||
and
|
||||
<varname>nproc</varname><indexterm>
|
||||
<primary>nproc</primary>
|
||||
</indexterm>
|
||||
</title>
|
||||
|
||||
<para>HBase is a database. It uses a lot of files all at the same time.
|
||||
The default ulimit -n -- i.e. user file limit -- of 1024 on most *nix systems
|
||||
is insufficient (On mac os x its 256). Any significant amount of loading will
|
||||
lead you to <link xlink:href="http://wiki.apache.org/hadoop/Hbase/FAQ#A6">FAQ: Why do I
|
||||
see "java.io.IOException...(Too many open files)" in my logs?</link>.
|
||||
You may also notice errors such as <programlisting>
|
||||
2010-04-06 03:04:37,542 INFO org.apache.hadoop.hdfs.DFSClient: Exception increateBlockOutputStream java.io.EOFException
|
||||
2010-04-06 03:04:37,542 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-6935524980745310745_1391901
|
||||
</programlisting> Do yourself a favor and change the upper bound on the
|
||||
number of file descriptors. Set it to north of 10k. See the above
|
||||
referenced FAQ for how. You should also up the hbase users'
|
||||
<varname>nproc</varname> setting; under load, a low-nproc
|
||||
setting could manifest as <classname>OutOfMemoryError</classname>
|
||||
<footnote><para>See Jack Levin's <link xlink:href="">major hdfs issues</link>
|
||||
note up on the user list.</para></footnote>
|
||||
<footnote><para>The requirement that a database requires upping of system limits
|
||||
is not peculiar to HBase. See for example the section
|
||||
<emphasis>Setting Shell Limits for the Oracle User</emphasis> in
|
||||
<link xlink:href="http://www.akadia.com/services/ora_linux_install_10g.html">
|
||||
Short Guide to install Oracle 10 on Linux</link>.</para></footnote>.
|
||||
</para>
|
||||
|
||||
<para>To be clear, upping the file descriptors and nproc for the user who is
|
||||
running the HBase process is an operating system configuration, not an
|
||||
HBase configuration. Also, a common mistake is that administrators
|
||||
will up the file descriptors for a particular user but for whatever
|
||||
reason, HBase will be running as some one else. HBase prints in its
|
||||
logs as the first line the ulimit its seeing. Ensure its correct.
|
||||
<footnote>
|
||||
<para>A useful read setting config on you hadoop cluster is Aaron
|
||||
Kimballs' <link
|
||||
xlink:ref="http://www.cloudera.com/blog/2009/03/configuration-parameters-what-can-you-just-ignore/">Configuration
|
||||
Parameters: What can you just ignore?</link></para>
|
||||
</footnote></para>
|
||||
|
||||
<section xml:id="ulimit_ubuntu">
|
||||
<title><varname>ulimit</varname> on Ubuntu</title>
|
||||
|
||||
<para>If you are on Ubuntu you will need to make the following
|
||||
changes:</para>
|
||||
|
||||
<para>In the file <filename>/etc/security/limits.conf</filename> add
|
||||
a line like: <programlisting>hadoop - nofile 32768</programlisting>
|
||||
Replace <varname>hadoop</varname> with whatever user is running
|
||||
Hadoop and HBase. If you have separate users, you will need 2
|
||||
entries, one for each user. In the same file set nproc hard and soft
|
||||
limits. For example: <programlisting>hadoop soft/hard nproc 32000</programlisting>.</para>
|
||||
|
||||
<para>In the file <filename>/etc/pam.d/common-session</filename> add
|
||||
as the last line in the file: <programlisting>session required pam_limits.so</programlisting>
|
||||
Otherwise the changes in <filename>/etc/security/limits.conf</filename> won't be
|
||||
applied.</para>
|
||||
|
||||
<para>Don't forget to log out and back in again for the changes to
|
||||
take effect!</para>
|
||||
</section>
|
||||
</section>
|
||||
|
||||
<section xml:id="windows">
|
||||
<title>Windows</title>
|
||||
|
||||
<para>HBase has been little tested running on Windows. Running a
|
||||
production install of HBase on top of Windows is not
|
||||
recommended.</para>
|
||||
|
||||
<para>If you are running HBase on Windows, you must install <link
|
||||
xlink:href="http://cygwin.com/">Cygwin</link> to have a *nix-like
|
||||
environment for the shell scripts. The full details are explained in
|
||||
the <link xlink:href="http://hbase.apache.org/cygwin.html">Windows
|
||||
Installation</link> guide. Also
|
||||
<link xlink:href="http://search-hadoop.com/?q=hbase+windows&fc_project=HBase&fc_type=mail+_hash_+dev">search our user mailing list</link> to pick
|
||||
up latest fixes figured by Windows users.</para>
|
||||
</section>
|
||||
|
||||
</section> <!-- OS -->
|
||||
|
||||
<section xml:id="hadoop">
|
||||
<title><link
|
||||
xlink:href="http://hadoop.apache.org">Hadoop</link><indexterm>
|
||||
<primary>Hadoop</primary>
|
||||
</indexterm></title>
|
||||
|
||||
<para>
|
||||
This version of HBase will only run on <link
|
||||
xlink:href="http://hadoop.apache.org/common/releases.html">Hadoop
|
||||
0.20.x</link>. It will not run on hadoop 0.21.x (nor 0.22.x).
|
||||
HBase will lose data unless it is running on an HDFS that has a durable
|
||||
<code>sync</code>. Hadoop 0.20.2 and Hadoop 0.20.203.0 DO NOT have this attribute.
|
||||
Currently only the <link
|
||||
xlink:href="http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20-append/">branch-0.20-append</link>
|
||||
branch has this attribute<footnote>
|
||||
<para>See <link
|
||||
xlink:href="http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20-append/CHANGES.txt">CHANGES.txt</link>
|
||||
in branch-0.20-append to see list of patches involved adding
|
||||
append on the Hadoop 0.20 branch.</para>
|
||||
</footnote>. No official releases have been made from the branch-0.20-append branch up
|
||||
to now so you will have to build your own Hadoop from the tip of this
|
||||
branch. Michael Noll has written a detailed blog,
|
||||
<link xlink:href="http://www.michael-noll.com/blog/2011/04/14/building-an-hadoop-0-20-x-version-for-hbase-0-90-2/">Building
|
||||
an Hadoop 0.20.x version for HBase 0.90.2</link>, on how to build an
|
||||
Hadoop from branch-0.20-append. Recommended.</para>
|
||||
|
||||
<para>Or rather than build your own, you could use Cloudera's <link
|
||||
xlink:href="http://archive.cloudera.com/docs/">CDH3</link>. CDH has
|
||||
the 0.20-append patches needed to add a durable sync (CDH3 betas will
|
||||
suffice; b2, b3, or b4).</para>
|
||||
|
||||
<para>Because HBase depends on Hadoop, it bundles an instance of the
|
||||
Hadoop jar under its <filename>lib</filename> directory. The bundled
|
||||
Hadoop was made from the Apache branch-0.20-append branch at the time
|
||||
of the HBase's release. The bundled jar is ONLY for use in standalone mode.
|
||||
In distributed mode, it is <emphasis>critical</emphasis> that the version of Hadoop that is out
|
||||
on your cluster match what is under HBase. Replace the hadoop jar found in the HBase
|
||||
<filename>lib</filename> directory with the hadoop jar you are running on
|
||||
your cluster to avoid version mismatch issues. Make sure you
|
||||
replace the jar in HBase everywhere on your cluster. Hadoop version
|
||||
mismatch issues have various manifestations but often all looks like
|
||||
its hung up.</para>
|
||||
|
||||
<section xml:id="hadoop.security">
|
||||
<title>Hadoop Security</title>
|
||||
<para>HBase will run on any Hadoop 0.20.x that incorporates Hadoop
|
||||
security features -- e.g. Y! 0.20S or CDH3B3 -- as long as you do as
|
||||
suggested above and replace the Hadoop jar that ships with HBase
|
||||
with the secure version.</para>
|
||||
</section>
|
||||
|
||||
<section xml:id="dfs.datanode.max.xcievers">
|
||||
<title><varname>dfs.datanode.max.xcievers</varname><indexterm>
|
||||
<primary>xcievers</primary>
|
||||
</indexterm></title>
|
||||
|
||||
<para>An Hadoop HDFS datanode has an upper bound on the number of
|
||||
files that it will serve at any one time. The upper bound parameter is
|
||||
called <varname>xcievers</varname> (yes, this is misspelled). Again,
|
||||
before doing any loading, make sure you have configured Hadoop's
|
||||
<filename>conf/hdfs-site.xml</filename> setting the
|
||||
<varname>xceivers</varname> value to at least the following:
|
||||
<programlisting>
|
||||
<property>
|
||||
<name>dfs.datanode.max.xcievers</name>
|
||||
<value>4096</value>
|
||||
</property>
|
||||
</programlisting></para>
|
||||
|
||||
<para>Be sure to restart your HDFS after making the above
|
||||
configuration.</para>
|
||||
|
||||
<para>Not having this configuration in place makes for strange looking
|
||||
failures. Eventually you'll see a complain in the datanode logs
|
||||
complaining about the xcievers exceeded, but on the run up to this one
|
||||
manifestation is complaint about missing blocks. For example:
|
||||
<code>10/12/08 20:10:31 INFO hdfs.DFSClient: Could not obtain block
|
||||
blk_XXXXXXXXXXXXXXXXXXXXXX_YYYYYYYY from any node:
|
||||
java.io.IOException: No live nodes contain current block. Will get new
|
||||
block locations from namenode and retry...</code>
|
||||
<footnote><para>See <link xlink:href="http://ccgtech.blogspot.com/2010/02/hadoop-hdfs-deceived-by-xciever.html">Hadoop HDFS: Deceived by Xciever</link> for an informative rant on xceivering.</para></footnote></para>
|
||||
</section>
|
||||
|
||||
</section> <!-- hadoop -->
|
||||
|
||||
<section xml:id="standalone_dist">
|
||||
<title>HBase run modes: Standalone and Distributed</title>
|
||||
|
||||
<para>HBase has two run modes: <xref linkend="standalone" /> and <xref linkend="distributed" />. Out of the box, HBase runs in
|
||||
standalone mode. To set up a distributed deploy, you will need to
|
||||
configure HBase by editing files in the HBase <filename>conf</filename>
|
||||
directory.</para>
|
||||
|
||||
<para>Whatever your mode, you will need to edit
|
||||
<code>conf/hbase-env.sh</code> to tell HBase which
|
||||
<command>java</command> to use. In this file you set HBase environment
|
||||
variables such as the heapsize and other options for the
|
||||
<application>JVM</application>, the preferred location for log files,
|
||||
etc. Set <varname>JAVA_HOME</varname> to point at the root of your
|
||||
<command>java</command> install.</para>
|
||||
|
||||
<section xml:id="standalone">
|
||||
<title>Standalone HBase</title>
|
||||
|
||||
<para>This is the default mode. Standalone mode is what is described
|
||||
in the <xref linkend="quickstart" /> section. In
|
||||
standalone mode, HBase does not use HDFS -- it uses the local
|
||||
filesystem instead -- and it runs all HBase daemons and a local
|
||||
ZooKeeper all up in the same JVM. Zookeeper binds to a well known port
|
||||
so clients may talk to HBase.</para>
|
||||
</section>
|
||||
|
||||
<section xml:id="distributed">
|
||||
<title>Distributed</title>
|
||||
|
||||
<para>Distributed mode can be subdivided into distributed but all
|
||||
daemons run on a single node -- a.k.a
|
||||
<emphasis>pseudo-distributed</emphasis>-- and
|
||||
<emphasis>fully-distributed</emphasis> where the daemons are spread
|
||||
across all nodes in the cluster <footnote>
|
||||
<para>The pseudo-distributed vs fully-distributed nomenclature
|
||||
comes from Hadoop.</para>
|
||||
</footnote>.</para>
|
||||
|
||||
<para>Distributed modes require an instance of the <emphasis>Hadoop
|
||||
Distributed File System</emphasis> (HDFS). See the Hadoop <link
|
||||
xlink:href="http://hadoop.apache.org/common/docs/current/api/overview-summary.html#overview_description">
|
||||
requirements and instructions</link> for how to set up a HDFS. Before
|
||||
proceeding, ensure you have an appropriate, working HDFS.</para>
|
||||
|
||||
<para>Below we describe the different distributed setups. Starting,
|
||||
verification and exploration of your install, whether a
|
||||
<emphasis>pseudo-distributed</emphasis> or
|
||||
<emphasis>fully-distributed</emphasis> configuration is described in a
|
||||
section that follows, <xref linkend="confirm" />. The same verification script applies to both
|
||||
deploy types.</para>
|
||||
|
||||
<section xml:id="pseudo">
|
||||
<title>Pseudo-distributed</title>
|
||||
|
||||
<para>A pseudo-distributed mode is simply a distributed mode run on
|
||||
a single host. Use this configuration testing and prototyping on
|
||||
HBase. Do not use this configuration for production nor for
|
||||
evaluating HBase performance.</para>
|
||||
|
||||
<para>Once you have confirmed your HDFS setup, edit
|
||||
<filename>conf/hbase-site.xml</filename>. This is the file into
|
||||
which you add local customizations and overrides for
|
||||
<xreg linkend="hbase_default_configurations" /> and <xref linkend="hdfs_client_conf" />. Point HBase at the running Hadoop HDFS
|
||||
instance by setting the <varname>hbase.rootdir</varname> property.
|
||||
This property points HBase at the Hadoop filesystem instance to use.
|
||||
For example, adding the properties below to your
|
||||
<filename>hbase-site.xml</filename> says that HBase should use the
|
||||
<filename>/hbase</filename> directory in the HDFS whose namenode is
|
||||
at port 9000 on your local machine, and that it should run with one
|
||||
replica only (recommended for pseudo-distributed mode):</para>
|
||||
|
||||
<programlisting>
|
||||
<configuration>
|
||||
...
|
||||
<property>
|
||||
<name>hbase.rootdir</name>
|
||||
<value>hdfs://localhost:9000/hbase</value>
|
||||
<description>The directory shared by RegionServers.
|
||||
</description>
|
||||
</property>
|
||||
<property>
|
||||
<name>dfs.replication</name>
|
||||
<value>1</value>
|
||||
<description>The replication count for HLog and HFile storage. Should not be greater than HDFS datanode count.
|
||||
</description>
|
||||
</property>
|
||||
...
|
||||
</configuration>
|
||||
</programlisting>
|
||||
|
||||
<note>
|
||||
<para>Let HBase create the <varname>hbase.rootdir</varname>
|
||||
directory. If you don't, you'll get warning saying HBase needs a
|
||||
migration run because the directory is missing files expected by
|
||||
HBase (it'll create them if you let it).</para>
|
||||
</note>
|
||||
|
||||
<note>
|
||||
<para>Above we bind to <varname>localhost</varname>. This means
|
||||
that a remote client cannot connect. Amend accordingly, if you
|
||||
want to connect from a remote location.</para>
|
||||
</note>
|
||||
|
||||
<para>Now skip to <xref linkend="confirm" /> for how to start and verify your
|
||||
pseudo-distributed install. <footnote>
|
||||
<para>See <link
|
||||
xlink:href="http://hbase.apache.org/pseudo-distributed.html">Pseudo-distributed
|
||||
mode extras</link> for notes on how to start extra Masters and
|
||||
RegionServers when running pseudo-distributed.</para>
|
||||
</footnote></para>
|
||||
</section>
|
||||
|
||||
<section xml:id="fully_dist">
|
||||
<title>Fully-distributed</title>
|
||||
|
||||
<para>For running a fully-distributed operation on more than one
|
||||
host, make the following configurations. In
|
||||
<filename>hbase-site.xml</filename>, add the property
|
||||
<varname>hbase.cluster.distributed</varname> and set it to
|
||||
<varname>true</varname> and point the HBase
|
||||
<varname>hbase.rootdir</varname> at the appropriate HDFS NameNode
|
||||
and location in HDFS where you would like HBase to write data. For
|
||||
example, if you namenode were running at namenode.example.org on
|
||||
port 9000 and you wanted to home your HBase in HDFS at
|
||||
<filename>/hbase</filename>, make the following
|
||||
configuration.</para>
|
||||
|
||||
<programlisting>
|
||||
<configuration>
|
||||
...
|
||||
<property>
|
||||
<name>hbase.rootdir</name>
|
||||
<value>hdfs://namenode.example.org:9000/hbase</value>
|
||||
<description>The directory shared by RegionServers.
|
||||
</description>
|
||||
</property>
|
||||
<property>
|
||||
<name>hbase.cluster.distributed</name>
|
||||
<value>true</value>
|
||||
<description>The mode the cluster will be in. Possible values are
|
||||
false: standalone and pseudo-distributed setups with managed Zookeeper
|
||||
true: fully-distributed with unmanaged Zookeeper Quorum (see hbase-env.sh)
|
||||
</description>
|
||||
</property>
|
||||
...
|
||||
</configuration>
|
||||
</programlisting>
|
||||
|
||||
<section xml:id="regionserver">
|
||||
<title><filename>regionservers</filename></title>
|
||||
|
||||
<para>In addition, a fully-distributed mode requires that you
|
||||
modify <filename>conf/regionservers</filename>. The
|
||||
<xref linkend="regionservers" /> file
|
||||
lists all hosts that you would have running
|
||||
<application>HRegionServer</application>s, one host per line (This
|
||||
file in HBase is like the Hadoop <filename>slaves</filename>
|
||||
file). All servers listed in this file will be started and stopped
|
||||
when HBase cluster start or stop is run.</para>
|
||||
</section>
|
||||
|
||||
<section xml:id="hbase.zookeeper">
|
||||
<title>ZooKeeper and HBase</title>
|
||||
<para>See section <xref linkend="zookeeper"/> for ZooKeeper setup for HBase.</para>
|
||||
</section>
|
||||
|
||||
<section xml:id="hdfs_client_conf">
|
||||
<title>HDFS Client Configuration</title>
|
||||
|
||||
<para>Of note, if you have made <emphasis>HDFS client
|
||||
configuration</emphasis> on your Hadoop cluster -- i.e.
|
||||
configuration you want HDFS clients to use as opposed to
|
||||
server-side configurations -- HBase will not see this
|
||||
configuration unless you do one of the following:</para>
|
||||
|
||||
<itemizedlist>
|
||||
<listitem>
|
||||
<para>Add a pointer to your <varname>HADOOP_CONF_DIR</varname>
|
||||
to the <varname>HBASE_CLASSPATH</varname> environment variable
|
||||
in <filename>hbase-env.sh</filename>.</para>
|
||||
</listitem>
|
||||
|
||||
<listitem>
|
||||
<para>Add a copy of <filename>hdfs-site.xml</filename> (or
|
||||
<filename>hadoop-site.xml</filename>) or, better, symlinks,
|
||||
under <filename>${HBASE_HOME}/conf</filename>, or</para>
|
||||
</listitem>
|
||||
|
||||
<listitem>
|
||||
<para>if only a small set of HDFS client configurations, add
|
||||
them to <filename>hbase-site.xml</filename>.</para>
|
||||
</listitem>
|
||||
</itemizedlist>
|
||||
|
||||
<para>An example of such an HDFS client configuration is
|
||||
<varname>dfs.replication</varname>. If for example, you want to
|
||||
run with a replication factor of 5, hbase will create files with
|
||||
the default of 3 unless you do the above to make the configuration
|
||||
available to HBase.</para>
|
||||
</section>
|
||||
</section>
|
||||
</section>
|
||||
|
||||
<section xml:id="confirm">
|
||||
<title>Running and Confirming Your Installation</title>
|
||||
|
||||
|
||||
|
||||
<para>Make sure HDFS is running first. Start and stop the Hadoop HDFS
|
||||
daemons by running <filename>bin/start-hdfs.sh</filename> over in the
|
||||
<varname>HADOOP_HOME</varname> directory. You can ensure it started
|
||||
properly by testing the <command>put</command> and
|
||||
<command>get</command> of files into the Hadoop filesystem. HBase does
|
||||
not normally use the mapreduce daemons. These do not need to be
|
||||
started.</para>
|
||||
|
||||
|
||||
|
||||
<para><emphasis>If</emphasis> you are managing your own ZooKeeper,
|
||||
start it and confirm its running else, HBase will start up ZooKeeper
|
||||
for you as part of its start process.</para>
|
||||
|
||||
|
||||
|
||||
<para>Start HBase with the following command:</para>
|
||||
|
||||
|
||||
|
||||
<programlisting>bin/start-hbase.sh</programlisting>
|
||||
|
||||
Run the above from the
|
||||
|
||||
<varname>HBASE_HOME</varname>
|
||||
|
||||
directory.
|
||||
|
||||
<para>You should now have a running HBase instance. HBase logs can be
|
||||
found in the <filename>logs</filename> subdirectory. Check them out
|
||||
especially if HBase had trouble starting.</para>
|
||||
|
||||
|
||||
|
||||
<para>HBase also puts up a UI listing vital attributes. By default its
|
||||
deployed on the Master host at port 60010 (HBase RegionServers listen
|
||||
on port 60020 by default and put up an informational http server at
|
||||
60030). If the Master were running on a host named
|
||||
<varname>master.example.org</varname> on the default port, to see the
|
||||
Master's homepage you'd point your browser at
|
||||
<filename>http://master.example.org:60010</filename>.</para>
|
||||
|
||||
|
||||
|
||||
<para>Once HBase has started, see the <xref linkend="shell_exercises" /> for how to
|
||||
create tables, add data, scan your insertions, and finally disable and
|
||||
drop your tables.</para>
|
||||
|
||||
|
||||
|
||||
<para>To stop HBase after exiting the HBase shell enter
|
||||
<programlisting>$ ./bin/stop-hbase.sh
|
||||
stopping hbase...............</programlisting> Shutdown can take a moment to
|
||||
complete. It can take longer if your cluster is comprised of many
|
||||
machines. If you are running a distributed operation, be sure to wait
|
||||
until HBase has shut down completely before stopping the Hadoop
|
||||
daemons.</para>
|
||||
|
||||
|
||||
</section>
|
||||
</section> <!-- run modes -->
|
||||
|
||||
<section xml:id="zookeeper">
|
||||
<title>ZooKeeper<indexterm>
|
||||
<primary>ZooKeeper</primary>
|
||||
</indexterm></title>
|
||||
|
||||
<para>A distributed HBase depends on a running ZooKeeper cluster.
|
||||
All participating nodes and clients need to be able to access the
|
||||
running ZooKeeper ensemble. HBase by default manages a ZooKeeper
|
||||
"cluster" for you. It will start and stop the ZooKeeper ensemble
|
||||
as part of the HBase start/stop process. You can also manage the
|
||||
ZooKeeper ensemble independent of HBase and just point HBase at
|
||||
the cluster it should use. To toggle HBase management of
|
||||
ZooKeeper, use the <varname>HBASE_MANAGES_ZK</varname> variable in
|
||||
<filename>conf/hbase-env.sh</filename>. This variable, which
|
||||
defaults to <varname>true</varname>, tells HBase whether to
|
||||
start/stop the ZooKeeper ensemble servers as part of HBase
|
||||
start/stop.</para>
|
||||
|
||||
<para>When HBase manages the ZooKeeper ensemble, you can specify
|
||||
ZooKeeper configuration using its native
|
||||
<filename>zoo.cfg</filename> file, or, the easier option is to
|
||||
just specify ZooKeeper options directly in
|
||||
<filename>conf/hbase-site.xml</filename>. A ZooKeeper
|
||||
configuration option can be set as a property in the HBase
|
||||
<filename>hbase-site.xml</filename> XML configuration file by
|
||||
prefacing the ZooKeeper option name with
|
||||
<varname>hbase.zookeeper.property</varname>. For example, the
|
||||
<varname>clientPort</varname> setting in ZooKeeper can be changed
|
||||
by setting the
|
||||
<varname>hbase.zookeeper.property.clientPort</varname> property.
|
||||
For all default values used by HBase, including ZooKeeper
|
||||
configuration, see <xref linkend="hbase_default_configurations" />. Look for the
|
||||
<varname>hbase.zookeeper.property</varname> prefix <footnote>
|
||||
<para>For the full list of ZooKeeper configurations, see
|
||||
ZooKeeper's <filename>zoo.cfg</filename>. HBase does not ship
|
||||
with a <filename>zoo.cfg</filename> so you will need to browse
|
||||
the <filename>conf</filename> directory in an appropriate
|
||||
ZooKeeper download.</para>
|
||||
</footnote></para>
|
||||
|
||||
<para>You must at least list the ensemble servers in
|
||||
<filename>hbase-site.xml</filename> using the
|
||||
<varname>hbase.zookeeper.quorum</varname> property. This property
|
||||
defaults to a single ensemble member at
|
||||
<varname>localhost</varname> which is not suitable for a fully
|
||||
distributed HBase. (It binds to the local machine only and remote
|
||||
clients will not be able to connect). <note xml:id="how_many_zks">
|
||||
<title>How many ZooKeepers should I run?</title>
|
||||
|
||||
<para>You can run a ZooKeeper ensemble that comprises 1 node
|
||||
only but in production it is recommended that you run a
|
||||
ZooKeeper ensemble of 3, 5 or 7 machines; the more members an
|
||||
ensemble has, the more tolerant the ensemble is of host
|
||||
failures. Also, run an odd number of machines. There can be no
|
||||
quorum if the number of members is an even number. Give each
|
||||
ZooKeeper server around 1GB of RAM, and if possible, its own
|
||||
dedicated disk (A dedicated disk is the best thing you can do
|
||||
to ensure a performant ZooKeeper ensemble). For very heavily
|
||||
loaded clusters, run ZooKeeper servers on separate machines
|
||||
from RegionServers (DataNodes and TaskTrackers).</para>
|
||||
</note></para>
|
||||
|
||||
<para>For example, to have HBase manage a ZooKeeper quorum on
|
||||
nodes <emphasis>rs{1,2,3,4,5}.example.com</emphasis>, bound to
|
||||
port 2222 (the default is 2181) ensure
|
||||
<varname>HBASE_MANAGE_ZK</varname> is commented out or set to
|
||||
<varname>true</varname> in <filename>conf/hbase-env.sh</filename>
|
||||
and then edit <filename>conf/hbase-site.xml</filename> and set
|
||||
<varname>hbase.zookeeper.property.clientPort</varname> and
|
||||
<varname>hbase.zookeeper.quorum</varname>. You should also set
|
||||
<varname>hbase.zookeeper.property.dataDir</varname> to other than
|
||||
the default as the default has ZooKeeper persist data under
|
||||
<filename>/tmp</filename> which is often cleared on system
|
||||
restart. In the example below we have ZooKeeper persist to
|
||||
<filename>/user/local/zookeeper</filename>. <programlisting>
|
||||
<configuration>
|
||||
...
|
||||
<property>
|
||||
<name>hbase.zookeeper.property.clientPort</name>
|
||||
<value>2222</value>
|
||||
<description>Property from ZooKeeper's config zoo.cfg.
|
||||
The port at which the clients will connect.
|
||||
</description>
|
||||
</property>
|
||||
<property>
|
||||
<name>hbase.zookeeper.quorum</name>
|
||||
<value>rs1.example.com,rs2.example.com,rs3.example.com,rs4.example.com,rs5.example.com</value>
|
||||
<description>Comma separated list of servers in the ZooKeeper Quorum.
|
||||
For example, "host1.mydomain.com,host2.mydomain.com,host3.mydomain.com".
|
||||
By default this is set to localhost for local and pseudo-distributed modes
|
||||
of operation. For a fully-distributed setup, this should be set to a full
|
||||
list of ZooKeeper quorum servers. If HBASE_MANAGES_ZK is set in hbase-env.sh
|
||||
this is the list of servers which we will start/stop ZooKeeper on.
|
||||
</description>
|
||||
</property>
|
||||
<property>
|
||||
<name>hbase.zookeeper.property.dataDir</name>
|
||||
<value>/usr/local/zookeeper</value>
|
||||
<description>Property from ZooKeeper's config zoo.cfg.
|
||||
The directory where the snapshot is stored.
|
||||
</description>
|
||||
</property>
|
||||
...
|
||||
</configuration></programlisting></para>
|
||||
|
||||
<section>
|
||||
<title>Using existing ZooKeeper ensemble</title>
|
||||
|
||||
<para>To point HBase at an existing ZooKeeper cluster, one that
|
||||
is not managed by HBase, set <varname>HBASE_MANAGES_ZK</varname>
|
||||
in <filename>conf/hbase-env.sh</filename> to false
|
||||
<programlisting>
|
||||
...
|
||||
# Tell HBase whether it should manage it's own instance of Zookeeper or not.
|
||||
export HBASE_MANAGES_ZK=false</programlisting> Next set ensemble locations
|
||||
and client port, if non-standard, in
|
||||
<filename>hbase-site.xml</filename>, or add a suitably
|
||||
configured <filename>zoo.cfg</filename> to HBase's
|
||||
<filename>CLASSPATH</filename>. HBase will prefer the
|
||||
configuration found in <filename>zoo.cfg</filename> over any
|
||||
settings in <filename>hbase-site.xml</filename>.</para>
|
||||
|
||||
<para>When HBase manages ZooKeeper, it will start/stop the
|
||||
ZooKeeper servers as a part of the regular start/stop scripts.
|
||||
If you would like to run ZooKeeper yourself, independent of
|
||||
HBase start/stop, you would do the following</para>
|
||||
|
||||
<programlisting>
|
||||
${HBASE_HOME}/bin/hbase-daemons.sh {start,stop} zookeeper
|
||||
</programlisting>
|
||||
|
||||
<para>Note that you can use HBase in this manner to spin up a
|
||||
ZooKeeper cluster, unrelated to HBase. Just make sure to set
|
||||
<varname>HBASE_MANAGES_ZK</varname> to <varname>false</varname>
|
||||
if you want it to stay up across HBase restarts so that when
|
||||
HBase shuts down, it doesn't take ZooKeeper down with it.</para>
|
||||
|
||||
<para>For more information about running a distinct ZooKeeper
|
||||
cluster, see the ZooKeeper <link
|
||||
xlink:href="http://hadoop.apache.org/zookeeper/docs/current/zookeeperStarted.html">Getting
|
||||
Started Guide</link>. Additionally, see the <link xlink:href="http://wiki.apache.org/hadoop/ZooKeeper/FAQ#A7">ZooKeeper Wiki</link> or the
|
||||
<link xlink:href="http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#sc_zkMulitServerSetup">ZooKeeper documentation</link>
|
||||
for more information on ZooKeeper sizing.
|
||||
</para>
|
||||
</section>
|
||||
</section> <!-- zookeeper -->
|
||||
|
||||
|
||||
<section xml:id="config.files">
|
||||
<title>Configuration Files</title>
|
||||
|
||||
<section xml:id="hbase.site">
|
||||
<title><filename>hbase-site.xml</filename> and <filename>hbase-default.xml</filename></title>
|
||||
<para>Just as in Hadoop where you add site-specific HDFS configuration
|
||||
|
@ -90,6 +741,183 @@ to ensure well-formedness of your document after an edit session.
|
|||
</para>
|
||||
</section>
|
||||
|
||||
<section xml:id="client_dependencies"><title>Client configuration and dependencies connecting to an HBase cluster</title>
|
||||
<para>
|
||||
Since the HBase Master may move around, clients bootstrap by looking to ZooKeeper for
|
||||
current critical locations. ZooKeeper is where all these values are kept. Thus clients
|
||||
require the location of the ZooKeeper ensemble information before they can do anything else.
|
||||
Usually this the ensemble location is kept out in the <filename>hbase-site.xml</filename> and
|
||||
is picked up by the client from the <varname>CLASSPATH</varname>.</para>
|
||||
|
||||
<para>If you are configuring an IDE to run a HBase client, you should
|
||||
include the <filename>conf/</filename> directory on your classpath so
|
||||
<filename>hbase-site.xml</filename> settings can be found (or
|
||||
add <filename>src/test/resources</filename> to pick up the hbase-site.xml
|
||||
used by tests).
|
||||
</para>
|
||||
<para>
|
||||
Minimally, a client of HBase needs the hbase, hadoop, log4j, commons-logging, commons-lang,
|
||||
and ZooKeeper jars in its <varname>CLASSPATH</varname> connecting to a cluster.
|
||||
</para>
|
||||
<para>
|
||||
An example basic <filename>hbase-site.xml</filename> for client only
|
||||
might look as follows:
|
||||
<programlisting><![CDATA[
|
||||
<?xml version="1.0"?>
|
||||
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
|
||||
<configuration>
|
||||
<property>
|
||||
<name>hbase.zookeeper.quorum</name>
|
||||
<value>example1,example2,example3</value>
|
||||
<description>The directory shared by region servers.
|
||||
</description>
|
||||
</property>
|
||||
</configuration>
|
||||
]]></programlisting>
|
||||
</para>
|
||||
|
||||
<section xml:id="java.client.config">
|
||||
<title>Java client configuration</title>
|
||||
<para>The configuration used by a Java client is kept
|
||||
in an <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HBaseConfiguration">HBaseConfiguration</link> instance.
|
||||
The factory method on HBaseConfiguration, <code>HBaseConfiguration.create();</code>,
|
||||
on invocation, will read in the content of the first <filename>hbase-site.xml</filename> found on
|
||||
the client's <varname>CLASSPATH</varname>, if one is present
|
||||
(Invocation will also factor in any <filename>hbase-default.xml</filename> found;
|
||||
an hbase-default.xml ships inside the <filename>hbase.X.X.X.jar</filename>).
|
||||
It is also possible to specify configuration directly without having to read from a
|
||||
<filename>hbase-site.xml</filename>. For example, to set the ZooKeeper
|
||||
ensemble for the cluster programmatically do as follows:
|
||||
<programlisting>Configuration config = HBaseConfiguration.create();
|
||||
config.set("hbase.zookeeper.quorum", "localhost"); // Here we are running zookeeper locally</programlisting>
|
||||
If multiple ZooKeeper instances make up your ZooKeeper ensemble,
|
||||
they may be specified in a comma-separated list (just as in the <filename>hbase-site.xml</filename> file).
|
||||
This populated <classname>Configuration</classname> instance can then be passed to an
|
||||
<link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html">HTable</link>,
|
||||
and so on.
|
||||
</para>
|
||||
</section>
|
||||
</section>
|
||||
|
||||
</section> <!-- config files -->
|
||||
|
||||
<section xml:id="example_config">
|
||||
<title>Example Configurations</title>
|
||||
|
||||
<section>
|
||||
<title>Basic Distributed HBase Install</title>
|
||||
|
||||
<para>Here is an example basic configuration for a distributed ten
|
||||
node cluster. The nodes are named <varname>example0</varname>,
|
||||
<varname>example1</varname>, etc., through node
|
||||
<varname>example9</varname> in this example. The HBase Master and the
|
||||
HDFS namenode are running on the node <varname>example0</varname>.
|
||||
RegionServers run on nodes
|
||||
<varname>example1</varname>-<varname>example9</varname>. A 3-node
|
||||
ZooKeeper ensemble runs on <varname>example1</varname>,
|
||||
<varname>example2</varname>, and <varname>example3</varname> on the
|
||||
default ports. ZooKeeper data is persisted to the directory
|
||||
<filename>/export/zookeeper</filename>. Below we show what the main
|
||||
configuration files -- <filename>hbase-site.xml</filename>,
|
||||
<filename>regionservers</filename>, and
|
||||
<filename>hbase-env.sh</filename> -- found in the HBase
|
||||
<filename>conf</filename> directory might look like.</para>
|
||||
|
||||
<section xml:id="hbase_site">
|
||||
<title><filename>hbase-site.xml</filename></title>
|
||||
|
||||
<programlisting>
|
||||
|
||||
<?xml version="1.0"?>
|
||||
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
|
||||
<configuration>
|
||||
<property>
|
||||
<name>hbase.zookeeper.quorum</name>
|
||||
<value>example1,example2,example3</value>
|
||||
<description>The directory shared by RegionServers.
|
||||
</description>
|
||||
</property>
|
||||
<property>
|
||||
<name>hbase.zookeeper.property.dataDir</name>
|
||||
<value>/export/zookeeper</value>
|
||||
<description>Property from ZooKeeper's config zoo.cfg.
|
||||
The directory where the snapshot is stored.
|
||||
</description>
|
||||
</property>
|
||||
<property>
|
||||
<name>hbase.rootdir</name>
|
||||
<value>hdfs://example0:9000/hbase</value>
|
||||
<description>The directory shared by RegionServers.
|
||||
</description>
|
||||
</property>
|
||||
<property>
|
||||
<name>hbase.cluster.distributed</name>
|
||||
<value>true</value>
|
||||
<description>The mode the cluster will be in. Possible values are
|
||||
false: standalone and pseudo-distributed setups with managed Zookeeper
|
||||
true: fully-distributed with unmanaged Zookeeper Quorum (see hbase-env.sh)
|
||||
</description>
|
||||
</property>
|
||||
</configuration>
|
||||
|
||||
</programlisting>
|
||||
</section>
|
||||
|
||||
<section xml:id="regionservers">
|
||||
<title><filename>regionservers</filename></title>
|
||||
|
||||
<para>In this file you list the nodes that will run RegionServers.
|
||||
In our case we run RegionServers on all but the head node
|
||||
<varname>example1</varname> which is carrying the HBase Master and
|
||||
the HDFS namenode</para>
|
||||
|
||||
<programlisting>
|
||||
example1
|
||||
example3
|
||||
example4
|
||||
example5
|
||||
example6
|
||||
example7
|
||||
example8
|
||||
example9
|
||||
</programlisting>
|
||||
</section>
|
||||
|
||||
<section xml:id="hbase_env">
|
||||
<title><filename>hbase-env.sh</filename></title>
|
||||
|
||||
<para>Below we use a <command>diff</command> to show the differences
|
||||
from default in the <filename>hbase-env.sh</filename> file. Here we
|
||||
are setting the HBase heap to be 4G instead of the default
|
||||
1G.</para>
|
||||
|
||||
<programlisting>
|
||||
|
||||
$ git diff hbase-env.sh
|
||||
diff --git a/conf/hbase-env.sh b/conf/hbase-env.sh
|
||||
index e70ebc6..96f8c27 100644
|
||||
--- a/conf/hbase-env.sh
|
||||
+++ b/conf/hbase-env.sh
|
||||
@@ -31,7 +31,7 @@ export JAVA_HOME=/usr/lib//jvm/java-6-sun/
|
||||
# export HBASE_CLASSPATH=
|
||||
|
||||
# The maximum amount of heap to use, in MB. Default is 1000.
|
||||
-# export HBASE_HEAPSIZE=1000
|
||||
+export HBASE_HEAPSIZE=4096
|
||||
|
||||
# Extra Java runtime options.
|
||||
# Below are what we set by default. May only work with SUN JVM.
|
||||
|
||||
</programlisting>
|
||||
|
||||
<para>Use <command>rsync</command> to copy the content of the
|
||||
<filename>conf</filename> directory to all nodes of the
|
||||
cluster.</para>
|
||||
</section>
|
||||
</section>
|
||||
</section> <!-- example config -->
|
||||
|
||||
|
||||
<section xml:id="important_configurations">
|
||||
<title>The Important Configurations</title>
|
||||
<para>Below we list what the <emphasis>important</emphasis>
|
||||
|
@ -99,10 +927,7 @@ to ensure well-formedness of your document after an edit session.
|
|||
|
||||
|
||||
<section xml:id="required_configuration"><title>Required Configurations</title>
|
||||
<para>See <xref linkend="requirements" />.
|
||||
It lists at least two required configurations needed running HBase bearing
|
||||
load: i.e. <xref linkend="ulimit" /> and
|
||||
<xref linkend="dfs.datanode.max.xcievers" />.
|
||||
<para>Review the <xref linkend="os" /> and <xref linkend="hadoop" /> sections.
|
||||
</para>
|
||||
</section>
|
||||
|
||||
|
@ -130,10 +955,7 @@ to ensure well-formedness of your document after an edit session.
|
|||
</para>
|
||||
</section>
|
||||
<section xml:id="zookeeper.instances"><title>Number of ZooKeeper Instances</title>
|
||||
<para>It is best to use an odd number of machines (1, 3, 5) for a ZooKeeper ensemble.
|
||||
See the <link xlink:href="http://wiki.apache.org/hadoop/ZooKeeper/FAQ#A7">ZooKeeper Wiki</link> or the
|
||||
<link xlink:href="http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#sc_zkMulitServerSetup">ZooKeeper documentation</link>
|
||||
for more information on ZooKeeper sizing.
|
||||
<para>See <xref linkend="zookeeper"/>.
|
||||
</para>
|
||||
</section>
|
||||
<section xml:id="hbase.regionserver.handler.count"><title><varname>hbase.regionserver.handler.count</varname></title>
|
||||
|
@ -248,63 +1070,5 @@ of all regions.
|
|||
</section>
|
||||
|
||||
</section>
|
||||
<section xml:id="client_dependencies"><title>Client configuration and dependencies connecting to an HBase cluster</title>
|
||||
|
||||
<para>
|
||||
Since the HBase Master may move around, clients bootstrap by looking to ZooKeeper for
|
||||
current critical locations. ZooKeeper is where all these values are kept. Thus clients
|
||||
require the location of the ZooKeeper ensemble information before they can do anything else.
|
||||
Usually this the ensemble location is kept out in the <filename>hbase-site.xml</filename> and
|
||||
is picked up by the client from the <varname>CLASSPATH</varname>.</para>
|
||||
|
||||
<para>If you are configuring an IDE to run a HBase client, you should
|
||||
include the <filename>conf/</filename> directory on your classpath so
|
||||
<filename>hbase-site.xml</filename> settings can be found (or
|
||||
add <filename>src/test/resources</filename> to pick up the hbase-site.xml
|
||||
used by tests).
|
||||
</para>
|
||||
<para>
|
||||
Minimally, a client of HBase needs the hbase, hadoop, log4j, commons-logging, commons-lang,
|
||||
and ZooKeeper jars in its <varname>CLASSPATH</varname> connecting to a cluster.
|
||||
</para>
|
||||
<para>
|
||||
An example basic <filename>hbase-site.xml</filename> for client only
|
||||
might look as follows:
|
||||
<programlisting><![CDATA[
|
||||
<?xml version="1.0"?>
|
||||
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
|
||||
<configuration>
|
||||
<property>
|
||||
<name>hbase.zookeeper.quorum</name>
|
||||
<value>example1,example2,example3</value>
|
||||
<description>The directory shared by region servers.
|
||||
</description>
|
||||
</property>
|
||||
</configuration>
|
||||
]]></programlisting>
|
||||
</para>
|
||||
<section>
|
||||
<title>Java client configuration</title>
|
||||
<subtitle>How Java reads <filename>hbase-site.xml</filename> content</subtitle>
|
||||
<para>The configuration used by a java client is kept
|
||||
in an <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HBaseConfiguration">HBaseConfiguration</link> instance.
|
||||
The factory method on HBaseConfiguration, <code>HBaseConfiguration.create();</code>,
|
||||
on invocation, will read in the content of the first <filename>hbase-site.xml</filename> found on
|
||||
the client's <varname>CLASSPATH</varname>, if one is present
|
||||
(Invocation will also factor in any <filename>hbase-default.xml</filename> found;
|
||||
an hbase-default.xml ships inside the <filename>hbase.X.X.X.jar</filename>).
|
||||
It is also possible to specify configuration directly without having to read from a
|
||||
<filename>hbase-site.xml</filename>. For example, to set the ZooKeeper
|
||||
ensemble for the cluster programmatically do as follows:
|
||||
<programlisting>Configuration config = HBaseConfiguration.create();
|
||||
config.set("hbase.zookeeper.quorum", "localhost"); // Here we are running zookeeper locally</programlisting>
|
||||
If multiple ZooKeeper instances make up your ZooKeeper ensemble,
|
||||
they may be specified in a comma-separated list (just as in the <filename>hbase-site.xml</filename> file).
|
||||
This populated <classname>Configuration</classname> instance can then be passed to an
|
||||
<link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html">HTable</link>,
|
||||
and so on.
|
||||
</para>
|
||||
</section>
|
||||
</section>
|
||||
|
||||
</chapter>
|
||||
|
|
|
@ -13,8 +13,8 @@
|
|||
<title>Introduction</title>
|
||||
|
||||
<para><xref linkend="quickstart" /> will get you up and
|
||||
running on a single-node instance of HBase using the local filesystem. The
|
||||
<xref linkend="notsoquick" /> describes setup
|
||||
running on a single-node instance of HBase using the local filesystem.
|
||||
<xref linkend="configuration" /> describes setup
|
||||
of HBase in distributed mode running on top of HDFS.</para>
|
||||
</section>
|
||||
|
||||
|
@ -179,770 +179,10 @@ stopping hbase...............</programlisting></para>
|
|||
<title>Where to go next</title>
|
||||
|
||||
<para>The above described standalone setup is good for testing and
|
||||
experiments only. Next move on to <xref linkend="notsoquick" /> where we'll go into
|
||||
experiments only. Next move on to <xref linkend="configuration" /> where we'll go into
|
||||
depth on the different HBase run modes, requirements and critical
|
||||
configurations needed setting up a distributed HBase deploy.</para>
|
||||
</section>
|
||||
</section>
|
||||
|
||||
<section xml:id="notsoquick">
|
||||
<title>Not-so-quick Start Guide</title>
|
||||
|
||||
<section xml:id="requirements">
|
||||
<title>Requirements</title>
|
||||
|
||||
<para>HBase has the following requirements. Please read the section
|
||||
below carefully and ensure that all requirements have been satisfied.
|
||||
Failure to do so will cause you (and us) grief debugging strange errors
|
||||
and/or data loss.</para>
|
||||
|
||||
<section xml:id="java">
|
||||
<title>java</title>
|
||||
|
||||
<para>Just like Hadoop, HBase requires java 6 from <link
|
||||
xlink:href="http://www.java.com/download/">Oracle</link>. Usually
|
||||
you'll want to use the latest version available except the problematic
|
||||
u18 (u24 is the latest version as of this writing).</para>
|
||||
</section>
|
||||
|
||||
<section xml:id="hadoop">
|
||||
<title><link
|
||||
xlink:href="http://hadoop.apache.org">hadoop</link><indexterm>
|
||||
<primary>Hadoop</primary>
|
||||
</indexterm></title>
|
||||
|
||||
<para>
|
||||
This version of HBase will only run on <link
|
||||
xlink:href="http://hadoop.apache.org/common/releases.html">Hadoop
|
||||
0.20.x</link>. It will not run on hadoop 0.21.x (nor 0.22.x).
|
||||
HBase will lose data unless it is running on an HDFS that has a durable
|
||||
<code>sync</code>. Hadoop 0.20.2 and Hadoop 0.20.203.0 DO NOT have this attribute.
|
||||
Currently only the <link
|
||||
xlink:href="http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20-append/">branch-0.20-append</link>
|
||||
branch has this attribute<footnote>
|
||||
<para>See <link
|
||||
xlink:href="http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20-append/CHANGES.txt">CHANGES.txt</link>
|
||||
in branch-0.20-append to see list of patches involved adding
|
||||
append on the Hadoop 0.20 branch.</para>
|
||||
</footnote>. No official releases have been made from the branch-0.20-append branch up
|
||||
to now so you will have to build your own Hadoop from the tip of this
|
||||
branch. Michael Noll has written a detailed blog,
|
||||
<link xlink:href="http://www.michael-noll.com/blog/2011/04/14/building-an-hadoop-0-20-x-version-for-hbase-0-90-2/">Building
|
||||
an Hadoop 0.20.x version for HBase 0.90.2</link>, on how to build an
|
||||
Hadoop from branch-0.20-append. Recommended.</para>
|
||||
|
||||
<para>Or rather than build your own, you could use Cloudera's <link
|
||||
xlink:href="http://archive.cloudera.com/docs/">CDH3</link>. CDH has
|
||||
the 0.20-append patches needed to add a durable sync (CDH3 betas will
|
||||
suffice; b2, b3, or b4).</para>
|
||||
|
||||
<para>Because HBase depends on Hadoop, it bundles an instance of the
|
||||
Hadoop jar under its <filename>lib</filename> directory. The bundled
|
||||
Hadoop was made from the Apache branch-0.20-append branch at the time
|
||||
of the HBase's release. The bundled jar is ONLY for use in standalone mode.
|
||||
In distributed mode, it is <emphasis>critical</emphasis> that the version of Hadoop that is out
|
||||
on your cluster match what is under HBase. Replace the hadoop jar found in the HBase
|
||||
<filename>lib</filename> directory with the hadoop jar you are running on
|
||||
your cluster to avoid version mismatch issues. Make sure you
|
||||
replace the jar in HBase everywhere on your cluster. Hadoop version
|
||||
mismatch issues have various manifestations but often all looks like
|
||||
its hung up.</para>
|
||||
|
||||
<note>
|
||||
<title>Hadoop Security</title>
|
||||
|
||||
<para>HBase will run on any Hadoop 0.20.x that incorporates Hadoop
|
||||
security features -- e.g. Y! 0.20S or CDH3B3 -- as long as you do as
|
||||
suggested above and replace the Hadoop jar that ships with HBase
|
||||
with the secure version.</para>
|
||||
</note>
|
||||
</section>
|
||||
|
||||
<section xml:id="ssh">
|
||||
<title>ssh</title>
|
||||
|
||||
<para><command>ssh</command> must be installed and
|
||||
<command>sshd</command> must be running to use Hadoop's scripts to
|
||||
manage remote Hadoop and HBase daemons. You must be able to ssh to all
|
||||
nodes, including your local node, using passwordless login (Google
|
||||
"ssh passwordless login").</para>
|
||||
</section>
|
||||
|
||||
<section xml:id="dns">
|
||||
<title>DNS</title>
|
||||
|
||||
<para>HBase uses the local hostname to self-report it's IP address.
|
||||
Both forward and reverse DNS resolving should work.</para>
|
||||
|
||||
<para>If your machine has multiple interfaces, HBase will use the
|
||||
interface that the primary hostname resolves to.</para>
|
||||
|
||||
<para>If this is insufficient, you can set
|
||||
<varname>hbase.regionserver.dns.interface</varname> to indicate the
|
||||
primary interface. This only works if your cluster configuration is
|
||||
consistent and every host has the same network interface
|
||||
configuration.</para>
|
||||
|
||||
<para>Another alternative is setting
|
||||
<varname>hbase.regionserver.dns.nameserver</varname> to choose a
|
||||
different nameserver than the system wide default.</para>
|
||||
</section>
|
||||
|
||||
<section xml:id="ntp">
|
||||
<title>NTP</title>
|
||||
|
||||
<para>The clocks on cluster members should be in basic alignments.
|
||||
Some skew is tolerable but wild skew could generate odd behaviors. Run
|
||||
<link
|
||||
xlink:href="http://en.wikipedia.org/wiki/Network_Time_Protocol">NTP</link>
|
||||
on your cluster, or an equivalent.</para>
|
||||
|
||||
<para>If you are having problems querying data, or "weird" cluster
|
||||
operations, check system time!</para>
|
||||
</section>
|
||||
|
||||
<section xml:id="ulimit">
|
||||
<title>
|
||||
<varname>ulimit</varname><indexterm>
|
||||
<primary>ulimit</primary>
|
||||
</indexterm>
|
||||
and
|
||||
<varname>nproc</varname><indexterm>
|
||||
<primary>nproc</primary>
|
||||
</indexterm>
|
||||
</title>
|
||||
|
||||
<para>HBase is a database. It uses a lot of files all at the same time.
|
||||
The default ulimit -n -- i.e. user file limit -- of 1024 on most *nix systems
|
||||
is insufficient (On mac os x its 256). Any significant amount of loading will
|
||||
lead you to <link xlink:href="http://wiki.apache.org/hadoop/Hbase/FAQ#A6">FAQ: Why do I
|
||||
see "java.io.IOException...(Too many open files)" in my logs?</link>.
|
||||
You may also notice errors such as <programlisting>
|
||||
2010-04-06 03:04:37,542 INFO org.apache.hadoop.hdfs.DFSClient: Exception increateBlockOutputStream java.io.EOFException
|
||||
2010-04-06 03:04:37,542 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-6935524980745310745_1391901
|
||||
</programlisting> Do yourself a favor and change the upper bound on the
|
||||
number of file descriptors. Set it to north of 10k. See the above
|
||||
referenced FAQ for how. You should also up the hbase users'
|
||||
<varname>nproc</varname> setting; under load, a low-nproc
|
||||
setting could manifest as <classname>OutOfMemoryError</classname>
|
||||
<footnote><para>See Jack Levin's <link xlink:href="">major hdfs issues</link>
|
||||
note up on the user list.</para></footnote>
|
||||
<footnote><para>The requirement that a database requires upping of system limits
|
||||
is not peculiar to HBase. See for example the section
|
||||
<emphasis>Setting Shell Limits for the Oracle User</emphasis> in
|
||||
<link xlink:href="http://www.akadia.com/services/ora_linux_install_10g.html">
|
||||
Short Guide to install Oracle 10 on Linux</link>.</para></footnote>.
|
||||
</para>
|
||||
|
||||
<para>To be clear, upping the file descriptors and nproc for the user who is
|
||||
running the HBase process is an operating system configuration, not an
|
||||
HBase configuration. Also, a common mistake is that administrators
|
||||
will up the file descriptors for a particular user but for whatever
|
||||
reason, HBase will be running as some one else. HBase prints in its
|
||||
logs as the first line the ulimit its seeing. Ensure its correct.
|
||||
<footnote>
|
||||
<para>A useful read setting config on you hadoop cluster is Aaron
|
||||
Kimballs' <link
|
||||
xlink:ref="http://www.cloudera.com/blog/2009/03/configuration-parameters-what-can-you-just-ignore/">Configuration
|
||||
Parameters: What can you just ignore?</link></para>
|
||||
</footnote></para>
|
||||
|
||||
<section xml:id="ulimit_ubuntu">
|
||||
<title><varname>ulimit</varname> on Ubuntu</title>
|
||||
|
||||
<para>If you are on Ubuntu you will need to make the following
|
||||
changes:</para>
|
||||
|
||||
<para>In the file <filename>/etc/security/limits.conf</filename> add
|
||||
a line like: <programlisting>hadoop - nofile 32768</programlisting>
|
||||
Replace <varname>hadoop</varname> with whatever user is running
|
||||
Hadoop and HBase. If you have separate users, you will need 2
|
||||
entries, one for each user. In the same file set nproc hard and soft
|
||||
limits. For example: <programlisting>hadoop soft/hard nproc 32000</programlisting>.</para>
|
||||
|
||||
<para>In the file <filename>/etc/pam.d/common-session</filename> add
|
||||
as the last line in the file: <programlisting>session required pam_limits.so</programlisting>
|
||||
Otherwise the changes in <filename>/etc/security/limits.conf</filename> won't be
|
||||
applied.</para>
|
||||
|
||||
<para>Don't forget to log out and back in again for the changes to
|
||||
take effect!</para>
|
||||
</section>
|
||||
</section>
|
||||
|
||||
<section xml:id="dfs.datanode.max.xcievers">
|
||||
<title><varname>dfs.datanode.max.xcievers</varname><indexterm>
|
||||
<primary>xcievers</primary>
|
||||
</indexterm></title>
|
||||
|
||||
<para>An Hadoop HDFS datanode has an upper bound on the number of
|
||||
files that it will serve at any one time. The upper bound parameter is
|
||||
called <varname>xcievers</varname> (yes, this is misspelled). Again,
|
||||
before doing any loading, make sure you have configured Hadoop's
|
||||
<filename>conf/hdfs-site.xml</filename> setting the
|
||||
<varname>xceivers</varname> value to at least the following:
|
||||
<programlisting>
|
||||
<property>
|
||||
<name>dfs.datanode.max.xcievers</name>
|
||||
<value>4096</value>
|
||||
</property>
|
||||
</programlisting></para>
|
||||
|
||||
<para>Be sure to restart your HDFS after making the above
|
||||
configuration.</para>
|
||||
|
||||
<para>Not having this configuration in place makes for strange looking
|
||||
failures. Eventually you'll see a complain in the datanode logs
|
||||
complaining about the xcievers exceeded, but on the run up to this one
|
||||
manifestation is complaint about missing blocks. For example:
|
||||
<code>10/12/08 20:10:31 INFO hdfs.DFSClient: Could not obtain block
|
||||
blk_XXXXXXXXXXXXXXXXXXXXXX_YYYYYYYY from any node:
|
||||
java.io.IOException: No live nodes contain current block. Will get new
|
||||
block locations from namenode and retry...</code>
|
||||
<footnote><para>See <link xlink:href="http://ccgtech.blogspot.com/2010/02/hadoop-hdfs-deceived-by-xciever.html">Hadoop HDFS: Deceived by Xciever</link> for an informative rant on xceivering.</para></footnote></para>
|
||||
</section>
|
||||
|
||||
<section xml:id="windows">
|
||||
<title>Windows</title>
|
||||
|
||||
<para>HBase has been little tested running on windows. Running a
|
||||
production install of HBase on top of windows is not
|
||||
recommended.</para>
|
||||
|
||||
<para>If you are running HBase on Windows, you must install <link
|
||||
xlink:href="http://cygwin.com/">Cygwin</link> to have a *nix-like
|
||||
environment for the shell scripts. The full details are explained in
|
||||
the <link xlink:href="http://hbase.apache.org/cygwin.html">Windows
|
||||
Installation</link> guide. Also
|
||||
<link xlink:href="http://search-hadoop.com/?q=hbase+windows&fc_project=HBase&fc_type=mail+_hash_+dev">search our user mailing list</link> to pick
|
||||
up latest fixes figured by windows users.</para>
|
||||
</section>
|
||||
</section>
|
||||
|
||||
<section xml:id="standalone_dist">
|
||||
<title>HBase run modes: Standalone and Distributed</title>
|
||||
|
||||
<para>HBase has two run modes: <xref linkend="standalone" /> and <xref linkend="distributed" />. Out of the box, HBase runs in
|
||||
standalone mode. To set up a distributed deploy, you will need to
|
||||
configure HBase by editing files in the HBase <filename>conf</filename>
|
||||
directory.</para>
|
||||
|
||||
<para>Whatever your mode, you will need to edit
|
||||
<code>conf/hbase-env.sh</code> to tell HBase which
|
||||
<command>java</command> to use. In this file you set HBase environment
|
||||
variables such as the heapsize and other options for the
|
||||
<application>JVM</application>, the preferred location for log files,
|
||||
etc. Set <varname>JAVA_HOME</varname> to point at the root of your
|
||||
<command>java</command> install.</para>
|
||||
|
||||
<section xml:id="standalone">
|
||||
<title>Standalone HBase</title>
|
||||
|
||||
<para>This is the default mode. Standalone mode is what is described
|
||||
in the <xref linkend="quickstart" /> section. In
|
||||
standalone mode, HBase does not use HDFS -- it uses the local
|
||||
filesystem instead -- and it runs all HBase daemons and a local
|
||||
ZooKeeper all up in the same JVM. Zookeeper binds to a well known port
|
||||
so clients may talk to HBase.</para>
|
||||
</section>
|
||||
|
||||
<section xml:id="distributed">
|
||||
<title>Distributed</title>
|
||||
|
||||
<para>Distributed mode can be subdivided into distributed but all
|
||||
daemons run on a single node -- a.k.a
|
||||
<emphasis>pseudo-distributed</emphasis>-- and
|
||||
<emphasis>fully-distributed</emphasis> where the daemons are spread
|
||||
across all nodes in the cluster <footnote>
|
||||
<para>The pseudo-distributed vs fully-distributed nomenclature
|
||||
comes from Hadoop.</para>
|
||||
</footnote>.</para>
|
||||
|
||||
<para>Distributed modes require an instance of the <emphasis>Hadoop
|
||||
Distributed File System</emphasis> (HDFS). See the Hadoop <link
|
||||
xlink:href="http://hadoop.apache.org/common/docs/current/api/overview-summary.html#overview_description">
|
||||
requirements and instructions</link> for how to set up a HDFS. Before
|
||||
proceeding, ensure you have an appropriate, working HDFS.</para>
|
||||
|
||||
<para>Below we describe the different distributed setups. Starting,
|
||||
verification and exploration of your install, whether a
|
||||
<emphasis>pseudo-distributed</emphasis> or
|
||||
<emphasis>fully-distributed</emphasis> configuration is described in a
|
||||
section that follows, <xref linkend="confirm" />. The same verification script applies to both
|
||||
deploy types.</para>
|
||||
|
||||
<section xml:id="pseudo">
|
||||
<title>Pseudo-distributed</title>
|
||||
|
||||
<para>A pseudo-distributed mode is simply a distributed mode run on
|
||||
a single host. Use this configuration testing and prototyping on
|
||||
HBase. Do not use this configuration for production nor for
|
||||
evaluating HBase performance.</para>
|
||||
|
||||
<para>Once you have confirmed your HDFS setup, edit
|
||||
<filename>conf/hbase-site.xml</filename>. This is the file into
|
||||
which you add local customizations and overrides for
|
||||
<xreg linkend="hbase_default_configurations" /> and <xref linkend="hdfs_client_conf" />. Point HBase at the running Hadoop HDFS
|
||||
instance by setting the <varname>hbase.rootdir</varname> property.
|
||||
This property points HBase at the Hadoop filesystem instance to use.
|
||||
For example, adding the properties below to your
|
||||
<filename>hbase-site.xml</filename> says that HBase should use the
|
||||
<filename>/hbase</filename> directory in the HDFS whose namenode is
|
||||
at port 9000 on your local machine, and that it should run with one
|
||||
replica only (recommended for pseudo-distributed mode):</para>
|
||||
|
||||
<programlisting>
|
||||
<configuration>
|
||||
...
|
||||
<property>
|
||||
<name>hbase.rootdir</name>
|
||||
<value>hdfs://localhost:9000/hbase</value>
|
||||
<description>The directory shared by RegionServers.
|
||||
</description>
|
||||
</property>
|
||||
<property>
|
||||
<name>dfs.replication</name>
|
||||
<value>1</value>
|
||||
<description>The replication count for HLog and HFile storage. Should not be greater than HDFS datanode count.
|
||||
</description>
|
||||
</property>
|
||||
...
|
||||
</configuration>
|
||||
</programlisting>
|
||||
|
||||
<note>
|
||||
<para>Let HBase create the <varname>hbase.rootdir</varname>
|
||||
directory. If you don't, you'll get warning saying HBase needs a
|
||||
migration run because the directory is missing files expected by
|
||||
HBase (it'll create them if you let it).</para>
|
||||
</note>
|
||||
|
||||
<note>
|
||||
<para>Above we bind to <varname>localhost</varname>. This means
|
||||
that a remote client cannot connect. Amend accordingly, if you
|
||||
want to connect from a remote location.</para>
|
||||
</note>
|
||||
|
||||
<para>Now skip to <xref linkend="confirm" /> for how to start and verify your
|
||||
pseudo-distributed install. <footnote>
|
||||
<para>See <link
|
||||
xlink:href="http://hbase.apache.org/pseudo-distributed.html">Pseudo-distributed
|
||||
mode extras</link> for notes on how to start extra Masters and
|
||||
RegionServers when running pseudo-distributed.</para>
|
||||
</footnote></para>
|
||||
</section>
|
||||
|
||||
<section xml:id="fully_dist">
|
||||
<title>Fully-distributed</title>
|
||||
|
||||
<para>For running a fully-distributed operation on more than one
|
||||
host, make the following configurations. In
|
||||
<filename>hbase-site.xml</filename>, add the property
|
||||
<varname>hbase.cluster.distributed</varname> and set it to
|
||||
<varname>true</varname> and point the HBase
|
||||
<varname>hbase.rootdir</varname> at the appropriate HDFS NameNode
|
||||
and location in HDFS where you would like HBase to write data. For
|
||||
example, if you namenode were running at namenode.example.org on
|
||||
port 9000 and you wanted to home your HBase in HDFS at
|
||||
<filename>/hbase</filename>, make the following
|
||||
configuration.</para>
|
||||
|
||||
<programlisting>
|
||||
<configuration>
|
||||
...
|
||||
<property>
|
||||
<name>hbase.rootdir</name>
|
||||
<value>hdfs://namenode.example.org:9000/hbase</value>
|
||||
<description>The directory shared by RegionServers.
|
||||
</description>
|
||||
</property>
|
||||
<property>
|
||||
<name>hbase.cluster.distributed</name>
|
||||
<value>true</value>
|
||||
<description>The mode the cluster will be in. Possible values are
|
||||
false: standalone and pseudo-distributed setups with managed Zookeeper
|
||||
true: fully-distributed with unmanaged Zookeeper Quorum (see hbase-env.sh)
|
||||
</description>
|
||||
</property>
|
||||
...
|
||||
</configuration>
|
||||
</programlisting>
|
||||
|
||||
<section xml:id="regionserver">
|
||||
<title><filename>regionservers</filename></title>
|
||||
|
||||
<para>In addition, a fully-distributed mode requires that you
|
||||
modify <filename>conf/regionservers</filename>. The
|
||||
<xref linkend="regionservers" /> file
|
||||
lists all hosts that you would have running
|
||||
<application>HRegionServer</application>s, one host per line (This
|
||||
file in HBase is like the Hadoop <filename>slaves</filename>
|
||||
file). All servers listed in this file will be started and stopped
|
||||
when HBase cluster start or stop is run.</para>
|
||||
</section>
|
||||
|
||||
<section xml:id="zookeeper">
|
||||
<title>ZooKeeper<indexterm>
|
||||
<primary>ZooKeeper</primary>
|
||||
</indexterm></title>
|
||||
|
||||
<para>A distributed HBase depends on a running ZooKeeper cluster.
|
||||
All participating nodes and clients need to be able to access the
|
||||
running ZooKeeper ensemble. HBase by default manages a ZooKeeper
|
||||
"cluster" for you. It will start and stop the ZooKeeper ensemble
|
||||
as part of the HBase start/stop process. You can also manage the
|
||||
ZooKeeper ensemble independent of HBase and just point HBase at
|
||||
the cluster it should use. To toggle HBase management of
|
||||
ZooKeeper, use the <varname>HBASE_MANAGES_ZK</varname> variable in
|
||||
<filename>conf/hbase-env.sh</filename>. This variable, which
|
||||
defaults to <varname>true</varname>, tells HBase whether to
|
||||
start/stop the ZooKeeper ensemble servers as part of HBase
|
||||
start/stop.</para>
|
||||
|
||||
<para>When HBase manages the ZooKeeper ensemble, you can specify
|
||||
ZooKeeper configuration using its native
|
||||
<filename>zoo.cfg</filename> file, or, the easier option is to
|
||||
just specify ZooKeeper options directly in
|
||||
<filename>conf/hbase-site.xml</filename>. A ZooKeeper
|
||||
configuration option can be set as a property in the HBase
|
||||
<filename>hbase-site.xml</filename> XML configuration file by
|
||||
prefacing the ZooKeeper option name with
|
||||
<varname>hbase.zookeeper.property</varname>. For example, the
|
||||
<varname>clientPort</varname> setting in ZooKeeper can be changed
|
||||
by setting the
|
||||
<varname>hbase.zookeeper.property.clientPort</varname> property.
|
||||
For all default values used by HBase, including ZooKeeper
|
||||
configuration, see <xref linkend="hbase_default_configurations" />. Look for the
|
||||
<varname>hbase.zookeeper.property</varname> prefix <footnote>
|
||||
<para>For the full list of ZooKeeper configurations, see
|
||||
ZooKeeper's <filename>zoo.cfg</filename>. HBase does not ship
|
||||
with a <filename>zoo.cfg</filename> so you will need to browse
|
||||
the <filename>conf</filename> directory in an appropriate
|
||||
ZooKeeper download.</para>
|
||||
</footnote></para>
|
||||
|
||||
<para>You must at least list the ensemble servers in
|
||||
<filename>hbase-site.xml</filename> using the
|
||||
<varname>hbase.zookeeper.quorum</varname> property. This property
|
||||
defaults to a single ensemble member at
|
||||
<varname>localhost</varname> which is not suitable for a fully
|
||||
distributed HBase. (It binds to the local machine only and remote
|
||||
clients will not be able to connect). <note xml:id="how_many_zks">
|
||||
<title>How many ZooKeepers should I run?</title>
|
||||
|
||||
<para>You can run a ZooKeeper ensemble that comprises 1 node
|
||||
only but in production it is recommended that you run a
|
||||
ZooKeeper ensemble of 3, 5 or 7 machines; the more members an
|
||||
ensemble has, the more tolerant the ensemble is of host
|
||||
failures. Also, run an odd number of machines. There can be no
|
||||
quorum if the number of members is an even number. Give each
|
||||
ZooKeeper server around 1GB of RAM, and if possible, its own
|
||||
dedicated disk (A dedicated disk is the best thing you can do
|
||||
to ensure a performant ZooKeeper ensemble). For very heavily
|
||||
loaded clusters, run ZooKeeper servers on separate machines
|
||||
from RegionServers (DataNodes and TaskTrackers).</para>
|
||||
</note></para>
|
||||
|
||||
<para>For example, to have HBase manage a ZooKeeper quorum on
|
||||
nodes <emphasis>rs{1,2,3,4,5}.example.com</emphasis>, bound to
|
||||
port 2222 (the default is 2181) ensure
|
||||
<varname>HBASE_MANAGE_ZK</varname> is commented out or set to
|
||||
<varname>true</varname> in <filename>conf/hbase-env.sh</filename>
|
||||
and then edit <filename>conf/hbase-site.xml</filename> and set
|
||||
<varname>hbase.zookeeper.property.clientPort</varname> and
|
||||
<varname>hbase.zookeeper.quorum</varname>. You should also set
|
||||
<varname>hbase.zookeeper.property.dataDir</varname> to other than
|
||||
the default as the default has ZooKeeper persist data under
|
||||
<filename>/tmp</filename> which is often cleared on system
|
||||
restart. In the example below we have ZooKeeper persist to
|
||||
<filename>/user/local/zookeeper</filename>. <programlisting>
|
||||
<configuration>
|
||||
...
|
||||
<property>
|
||||
<name>hbase.zookeeper.property.clientPort</name>
|
||||
<value>2222</value>
|
||||
<description>Property from ZooKeeper's config zoo.cfg.
|
||||
The port at which the clients will connect.
|
||||
</description>
|
||||
</property>
|
||||
<property>
|
||||
<name>hbase.zookeeper.quorum</name>
|
||||
<value>rs1.example.com,rs2.example.com,rs3.example.com,rs4.example.com,rs5.example.com</value>
|
||||
<description>Comma separated list of servers in the ZooKeeper Quorum.
|
||||
For example, "host1.mydomain.com,host2.mydomain.com,host3.mydomain.com".
|
||||
By default this is set to localhost for local and pseudo-distributed modes
|
||||
of operation. For a fully-distributed setup, this should be set to a full
|
||||
list of ZooKeeper quorum servers. If HBASE_MANAGES_ZK is set in hbase-env.sh
|
||||
this is the list of servers which we will start/stop ZooKeeper on.
|
||||
</description>
|
||||
</property>
|
||||
<property>
|
||||
<name>hbase.zookeeper.property.dataDir</name>
|
||||
<value>/usr/local/zookeeper</value>
|
||||
<description>Property from ZooKeeper's config zoo.cfg.
|
||||
The directory where the snapshot is stored.
|
||||
</description>
|
||||
</property>
|
||||
...
|
||||
</configuration></programlisting></para>
|
||||
|
||||
<section>
|
||||
<title>Using existing ZooKeeper ensemble</title>
|
||||
|
||||
<para>To point HBase at an existing ZooKeeper cluster, one that
|
||||
is not managed by HBase, set <varname>HBASE_MANAGES_ZK</varname>
|
||||
in <filename>conf/hbase-env.sh</filename> to false
|
||||
<programlisting>
|
||||
...
|
||||
# Tell HBase whether it should manage it's own instance of Zookeeper or not.
|
||||
export HBASE_MANAGES_ZK=false</programlisting> Next set ensemble locations
|
||||
and client port, if non-standard, in
|
||||
<filename>hbase-site.xml</filename>, or add a suitably
|
||||
configured <filename>zoo.cfg</filename> to HBase's
|
||||
<filename>CLASSPATH</filename>. HBase will prefer the
|
||||
configuration found in <filename>zoo.cfg</filename> over any
|
||||
settings in <filename>hbase-site.xml</filename>.</para>
|
||||
|
||||
<para>When HBase manages ZooKeeper, it will start/stop the
|
||||
ZooKeeper servers as a part of the regular start/stop scripts.
|
||||
If you would like to run ZooKeeper yourself, independent of
|
||||
HBase start/stop, you would do the following</para>
|
||||
|
||||
<programlisting>
|
||||
${HBASE_HOME}/bin/hbase-daemons.sh {start,stop} zookeeper
|
||||
</programlisting>
|
||||
|
||||
<para>Note that you can use HBase in this manner to spin up a
|
||||
ZooKeeper cluster, unrelated to HBase. Just make sure to set
|
||||
<varname>HBASE_MANAGES_ZK</varname> to <varname>false</varname>
|
||||
if you want it to stay up across HBase restarts so that when
|
||||
HBase shuts down, it doesn't take ZooKeeper down with it.</para>
|
||||
|
||||
<para>For more information about running a distinct ZooKeeper
|
||||
cluster, see the ZooKeeper <link
|
||||
xlink:href="http://hadoop.apache.org/zookeeper/docs/current/zookeeperStarted.html">Getting
|
||||
Started Guide</link>.</para>
|
||||
</section>
|
||||
</section>
|
||||
|
||||
<section xml:id="hdfs_client_conf">
|
||||
<title>HDFS Client Configuration</title>
|
||||
|
||||
<para>Of note, if you have made <emphasis>HDFS client
|
||||
configuration</emphasis> on your Hadoop cluster -- i.e.
|
||||
configuration you want HDFS clients to use as opposed to
|
||||
server-side configurations -- HBase will not see this
|
||||
configuration unless you do one of the following:</para>
|
||||
|
||||
<itemizedlist>
|
||||
<listitem>
|
||||
<para>Add a pointer to your <varname>HADOOP_CONF_DIR</varname>
|
||||
to the <varname>HBASE_CLASSPATH</varname> environment variable
|
||||
in <filename>hbase-env.sh</filename>.</para>
|
||||
</listitem>
|
||||
|
||||
<listitem>
|
||||
<para>Add a copy of <filename>hdfs-site.xml</filename> (or
|
||||
<filename>hadoop-site.xml</filename>) or, better, symlinks,
|
||||
under <filename>${HBASE_HOME}/conf</filename>, or</para>
|
||||
</listitem>
|
||||
|
||||
<listitem>
|
||||
<para>if only a small set of HDFS client configurations, add
|
||||
them to <filename>hbase-site.xml</filename>.</para>
|
||||
</listitem>
|
||||
</itemizedlist>
|
||||
|
||||
<para>An example of such an HDFS client configuration is
|
||||
<varname>dfs.replication</varname>. If for example, you want to
|
||||
run with a replication factor of 5, hbase will create files with
|
||||
the default of 3 unless you do the above to make the configuration
|
||||
available to HBase.</para>
|
||||
</section>
|
||||
</section>
|
||||
</section>
|
||||
|
||||
<section xml:id="confirm">
|
||||
<title>Running and Confirming Your Installation</title>
|
||||
|
||||
|
||||
|
||||
<para>Make sure HDFS is running first. Start and stop the Hadoop HDFS
|
||||
daemons by running <filename>bin/start-hdfs.sh</filename> over in the
|
||||
<varname>HADOOP_HOME</varname> directory. You can ensure it started
|
||||
properly by testing the <command>put</command> and
|
||||
<command>get</command> of files into the Hadoop filesystem. HBase does
|
||||
not normally use the mapreduce daemons. These do not need to be
|
||||
started.</para>
|
||||
|
||||
|
||||
|
||||
<para><emphasis>If</emphasis> you are managing your own ZooKeeper,
|
||||
start it and confirm its running else, HBase will start up ZooKeeper
|
||||
for you as part of its start process.</para>
|
||||
|
||||
|
||||
|
||||
<para>Start HBase with the following command:</para>
|
||||
|
||||
|
||||
|
||||
<programlisting>bin/start-hbase.sh</programlisting>
|
||||
|
||||
Run the above from the
|
||||
|
||||
<varname>HBASE_HOME</varname>
|
||||
|
||||
directory.
|
||||
|
||||
<para>You should now have a running HBase instance. HBase logs can be
|
||||
found in the <filename>logs</filename> subdirectory. Check them out
|
||||
especially if HBase had trouble starting.</para>
|
||||
|
||||
|
||||
|
||||
<para>HBase also puts up a UI listing vital attributes. By default its
|
||||
deployed on the Master host at port 60010 (HBase RegionServers listen
|
||||
on port 60020 by default and put up an informational http server at
|
||||
60030). If the Master were running on a host named
|
||||
<varname>master.example.org</varname> on the default port, to see the
|
||||
Master's homepage you'd point your browser at
|
||||
<filename>http://master.example.org:60010</filename>.</para>
|
||||
|
||||
|
||||
|
||||
<para>Once HBase has started, see the <xref linkend="shell_exercises" /> for how to
|
||||
create tables, add data, scan your insertions, and finally disable and
|
||||
drop your tables.</para>
|
||||
|
||||
|
||||
|
||||
<para>To stop HBase after exiting the HBase shell enter
|
||||
<programlisting>$ ./bin/stop-hbase.sh
|
||||
stopping hbase...............</programlisting> Shutdown can take a moment to
|
||||
complete. It can take longer if your cluster is comprised of many
|
||||
machines. If you are running a distributed operation, be sure to wait
|
||||
until HBase has shut down completely before stopping the Hadoop
|
||||
daemons.</para>
|
||||
|
||||
|
||||
</section>
|
||||
</section>
|
||||
|
||||
<section xml:id="example_config">
|
||||
<title>Example Configurations</title>
|
||||
|
||||
<section>
|
||||
<title>Basic Distributed HBase Install</title>
|
||||
|
||||
<para>Here is an example basic configuration for a distributed ten
|
||||
node cluster. The nodes are named <varname>example0</varname>,
|
||||
<varname>example1</varname>, etc., through node
|
||||
<varname>example9</varname> in this example. The HBase Master and the
|
||||
HDFS namenode are running on the node <varname>example0</varname>.
|
||||
RegionServers run on nodes
|
||||
<varname>example1</varname>-<varname>example9</varname>. A 3-node
|
||||
ZooKeeper ensemble runs on <varname>example1</varname>,
|
||||
<varname>example2</varname>, and <varname>example3</varname> on the
|
||||
default ports. ZooKeeper data is persisted to the directory
|
||||
<filename>/export/zookeeper</filename>. Below we show what the main
|
||||
configuration files -- <filename>hbase-site.xml</filename>,
|
||||
<filename>regionservers</filename>, and
|
||||
<filename>hbase-env.sh</filename> -- found in the HBase
|
||||
<filename>conf</filename> directory might look like.</para>
|
||||
|
||||
<section xml:id="hbase_site">
|
||||
<title><filename>hbase-site.xml</filename></title>
|
||||
|
||||
<programlisting>
|
||||
|
||||
<?xml version="1.0"?>
|
||||
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
|
||||
<configuration>
|
||||
<property>
|
||||
<name>hbase.zookeeper.quorum</name>
|
||||
<value>example1,example2,example3</value>
|
||||
<description>The directory shared by RegionServers.
|
||||
</description>
|
||||
</property>
|
||||
<property>
|
||||
<name>hbase.zookeeper.property.dataDir</name>
|
||||
<value>/export/zookeeper</value>
|
||||
<description>Property from ZooKeeper's config zoo.cfg.
|
||||
The directory where the snapshot is stored.
|
||||
</description>
|
||||
</property>
|
||||
<property>
|
||||
<name>hbase.rootdir</name>
|
||||
<value>hdfs://example0:9000/hbase</value>
|
||||
<description>The directory shared by RegionServers.
|
||||
</description>
|
||||
</property>
|
||||
<property>
|
||||
<name>hbase.cluster.distributed</name>
|
||||
<value>true</value>
|
||||
<description>The mode the cluster will be in. Possible values are
|
||||
false: standalone and pseudo-distributed setups with managed Zookeeper
|
||||
true: fully-distributed with unmanaged Zookeeper Quorum (see hbase-env.sh)
|
||||
</description>
|
||||
</property>
|
||||
</configuration>
|
||||
|
||||
</programlisting>
|
||||
</section>
|
||||
|
||||
<section xml:id="regionservers">
|
||||
<title><filename>regionservers</filename></title>
|
||||
|
||||
<para>In this file you list the nodes that will run RegionServers.
|
||||
In our case we run RegionServers on all but the head node
|
||||
<varname>example1</varname> which is carrying the HBase Master and
|
||||
the HDFS namenode</para>
|
||||
|
||||
<programlisting>
|
||||
example1
|
||||
example3
|
||||
example4
|
||||
example5
|
||||
example6
|
||||
example7
|
||||
example8
|
||||
example9
|
||||
</programlisting>
|
||||
</section>
|
||||
|
||||
<section xml:id="hbase_env">
|
||||
<title><filename>hbase-env.sh</filename></title>
|
||||
|
||||
<para>Below we use a <command>diff</command> to show the differences
|
||||
from default in the <filename>hbase-env.sh</filename> file. Here we
|
||||
are setting the HBase heap to be 4G instead of the default
|
||||
1G.</para>
|
||||
|
||||
<programlisting>
|
||||
|
||||
$ git diff hbase-env.sh
|
||||
diff --git a/conf/hbase-env.sh b/conf/hbase-env.sh
|
||||
index e70ebc6..96f8c27 100644
|
||||
--- a/conf/hbase-env.sh
|
||||
+++ b/conf/hbase-env.sh
|
||||
@@ -31,7 +31,7 @@ export JAVA_HOME=/usr/lib//jvm/java-6-sun/
|
||||
# export HBASE_CLASSPATH=
|
||||
|
||||
# The maximum amount of heap to use, in MB. Default is 1000.
|
||||
-# export HBASE_HEAPSIZE=1000
|
||||
+export HBASE_HEAPSIZE=4096
|
||||
|
||||
# Extra Java runtime options.
|
||||
# Below are what we set by default. May only work with SUN JVM.
|
||||
|
||||
</programlisting>
|
||||
|
||||
<para>Use <command>rsync</command> to copy the content of the
|
||||
<filename>conf</filename> directory to all nodes of the
|
||||
cluster.</para>
|
||||
</section>
|
||||
</section>
|
||||
</section>
|
||||
</section>
|
||||
</chapter>
|
||||
|
|
|
@ -9,7 +9,7 @@
|
|||
xmlns:db="http://docbook.org/ns/docbook">
|
||||
<title>Upgrading</title>
|
||||
<para>
|
||||
Review <xref linkend="requirements" />, in particular the section on Hadoop version.
|
||||
Review <xref linkend="configuration" />, in particular the section on Hadoop version.
|
||||
</para>
|
||||
<section xml:id="upgrade0.90">
|
||||
<title>Upgrading to HBase 0.90.x from 0.20.x or 0.89.x</title>
|
||||
|
|
Loading…
Reference in New Issue