HBASE-11399 Improve Quickstart chapter and move Pseudo-distributed and distrbuted into it (Misty Stanley-Jones)

This commit is contained in:
Jonathan M Hsieh 2014-07-02 11:27:18 -07:00
parent 8b1c29b8be
commit 4d141a2436
2 changed files with 1007 additions and 524 deletions

View File

@ -29,228 +29,319 @@
*/
-->
<title>Apache HBase Configuration</title>
<para>This chapter is the Not-So-Quick start guide to Apache HBase configuration. It goes over
system requirements, Hadoop setup, the different Apache HBase run modes, and the various
configurations in HBase. Please read this chapter carefully. At a minimum ensure that all <xref
linkend="basic.prerequisites" /> have been satisfied. Failure to do so will cause you (and us)
grief debugging strange errors and/or data loss.</para>
<para>This chapter expands upon the <xref linkend="getting_started" /> chapter to further explain
configuration of Apache HBase. Please read this chapter carefully, especially <xref
linkend="basic.prerequisites" /> to ensure that your HBase testing and deployment goes
smoothly, and prevent data loss.</para>
<para> Apache HBase uses the same configuration system as Apache Hadoop. To configure a deploy,
edit a file of environment variables in <filename>conf/hbase-env.sh</filename> -- this
configuration is used mostly by the launcher shell scripts getting the cluster off the ground --
and then add configuration to an XML file to do things like override HBase defaults, tell HBase
what Filesystem to use, and the location of the ZooKeeper ensemble. <footnote>
<para> Be careful editing XML. Make sure you close all elements. Run your file through
<command>xmllint</command> or similar to ensure well-formedness of your document after an
edit session. </para>
</footnote></para>
<para> Apache HBase uses the same configuration system as Apache Hadoop. All configuration files
are located in the <filename>conf/</filename> directory, which needs to be kept in sync for each
node on your cluster.</para>
<variablelist>
<title>HBase Configuration Files</title>
<varlistentry>
<term><filename>backup-masters</filename></term>
<listitem>
<para>Not present by default. A plain-text file which lists hosts on which the Master should
start a backup Master process, one host per line.</para>
</listitem>
</varlistentry>
<varlistentry>
<term><filename>hadoop-metrics2-hbase.properties</filename></term>
<listitem>
<para>Used to connect HBase Hadoop's Metrics2 framework. See the <link
xlink:href="http://wiki.apache.org/hadoop/HADOOP-6728-MetricsV2">Hadoop Wiki
entry</link> for more information on Metrics2. Contains only commented-out examples by
default.</para>
</listitem>
</varlistentry>
<varlistentry>
<term><filename>hbase-env.cmd</filename> and <filename>hbase-env.sh</filename></term>
<listitem>
<para>Script for Windows and Linux / Unix environments to set up the working environment for
HBase, including the location of Java, Java options, and other environment variables. The
file contains many commented-out examples to provide guidance.</para>
</listitem>
</varlistentry>
<varlistentry>
<term><filename>hbase-policy.xml</filename></term>
<listitem>
<para>The default policy configuration file used by RPC servers to make authorization
decisions on client requests. Only used if HBase security (<xref
linkend="security" />) is enabled.</para>
</listitem>
</varlistentry>
<varlistentry>
<term><filename>hbase-site.xml</filename></term>
<listitem>
<para>The main HBase configuration file. This file specifies configuration options which
override HBase's default configuration. You can view (but do not edit) the default
configuration file at <filename>docs/hbase-default.xml</filename>. You can also view the
entire effective configuration for your cluster (defaults and overrides) in the
<guilabel>HBase Configuration</guilabel> tab of the HBase Web UI.</para>
</listitem>
</varlistentry>
<varlistentry>
<term><filename>log4j.properties</filename></term>
<listitem>
<para>Configuration file for HBase logging via <code>log4j</code>.</para>
</listitem>
</varlistentry>
<varlistentry>
<term><filename>regionservers</filename></term>
<listitem>
<para>A plain-text file containing a list of hosts which should run a RegionServer in your
HBase cluster. By default this file contains the single entry
<literal>localhost</literal>. It should contain a list of hostnames or IP addresses, one
per line, and should only contain <literal>localhost</literal> if each node in your
cluster will run a RegionServer on its <literal>localhost</literal> interface.</para>
</listitem>
</varlistentry>
</variablelist>
<tip>
<title>Checking XML Validity</title>
<para>When you edit XML, it is a good idea to use an XML-aware editor to be sure that your
syntax is correct and your XML is well-formed. You can also use the <command>xmllint</command>
utility to check that your XML is well-formed. By default, <command>xmllint</command> re-flows
and prints the XML to standard output. To check for well-formedness and only print output if
errors exist, use the command <command>xmllint -noout
<replaceable>filename.xml</replaceable></command>.</para>
</tip>
<para>When running in distributed mode, after you make an edit to an HBase configuration, make
sure you copy the content of the <filename>conf</filename> directory to all nodes of the
cluster. HBase will not do this for you. Use <command>rsync</command>. For most configuration, a
restart is needed for servers to pick up changes (caveat dynamic config. to be described later
below).</para>
<warning>
<title>Keep Configuration In Sync Across the Cluster</title>
<para>When running in distributed mode, after you make an edit to an HBase configuration, make
sure you copy the content of the <filename>conf/</filename> directory to all nodes of the
cluster. HBase will not do this for you. Use <command>rsync</command>, <command>scp</command>,
or another secure mechanism for copying the configuration files to your nodes. For most
configuration, a restart is needed for servers to pick up changes An exception is dynamic
configuration. to be described later below.</para>
</warning>
<section
xml:id="basic.prerequisites">
<title>Basic Prerequisites</title>
<para>This section lists required services and some required system configuration. </para>
<section
<table
xml:id="java">
<title>Java</title>
<para>HBase requires at least Java 6 from <link
xlink:href="http://www.java.com/download/">Oracle</link>. The following table lists which JDK version are
compatible with each version of HBase.</para>
<informaltable>
<tgroup cols="4">
<thead>
<row>
<entry>HBase Version</entry>
<entry>JDK 6</entry>
<entry>JDK 7</entry>
<entry>JDK 8</entry>
</row>
</thead>
<tbody>
<row>
<entry>1.0</entry>
<entry><link xlink:href="http://search-hadoop.com/m/DHED4Zlz0R1">Not Supported</link></entry>
<entry>yes</entry>
<entry><para>Running with JDK 8 will work but is not well tested.</para></entry>
</row>
<row>
<entry>0.98</entry>
<entry>yes</entry>
<entry>yes</entry>
<entry><para>Running with JDK 8 works but is not well tested. Building with JDK 8
would require removal of the deprecated remove() method of the PoolMap class and is
under consideration. See ee <link
xlink:href="https://issues.apache.org/jira/browse/HBASE-7608">HBASE-7608</link> for
more information about JDK 8 support.</para></entry>
</row>
<row>
<entry>0.96</entry>
<entry>yes</entry>
<entry>yes</entry>
<entry></entry>
</row>
<row>
<entry>0.94</entry>
<entry>yes</entry>
<entry>yes</entry>
<entry></entry>
</row>
</tbody>
</tgroup>
</informaltable>
</section>
<textobject>
<para>HBase requires at least Java 6 from <link
xlink:href="http://www.java.com/download/">Oracle</link>. The following table lists
which JDK version are compatible with each version of HBase.</para>
</textobject>
<tgroup
cols="4">
<thead>
<row>
<entry>HBase Version</entry>
<entry>JDK 6</entry>
<entry>JDK 7</entry>
<entry>JDK 8</entry>
</row>
</thead>
<tbody>
<row>
<entry>1.0</entry>
<entry><link
xlink:href="http://search-hadoop.com/m/DHED4Zlz0R1">Not Supported</link></entry>
<entry>yes</entry>
<entry><para>Running with JDK 8 will work but is not well tested.</para></entry>
</row>
<row>
<entry>0.98</entry>
<entry>yes</entry>
<entry>yes</entry>
<entry><para>Running with JDK 8 works but is not well tested. Building with JDK 8 would
require removal of the deprecated remove() method of the PoolMap class and is under
consideration. See ee <link
xlink:href="https://issues.apache.org/jira/browse/HBASE-7608">HBASE-7608</link>
for more information about JDK 8 support.</para></entry>
</row>
<row>
<entry>0.96</entry>
<entry>yes</entry>
<entry>yes</entry>
<entry />
</row>
<row>
<entry>0.94</entry>
<entry>yes</entry>
<entry>yes</entry>
<entry />
</row>
</tbody>
</tgroup>
</table>
<section
<variablelist
xml:id="os">
<title>Operating System</title>
<section
<title>Operating System Utilities</title>
<varlistentry
xml:id="ssh">
<title>ssh</title>
<para><command>ssh</command> must be installed and <command>sshd</command> must be running
to use Hadoop's scripts to manage remote Hadoop and HBase daemons. You must be able to ssh
to all nodes, including your local node, using passwordless login (Google "ssh
passwordless login"). If on mac osx, see the section, <link
xlink:href="http://wiki.apache.org/hadoop/Running_Hadoop_On_OS_X_10.5_64-bit_%28Single-Node_Cluster%29">SSH:
Setting up Remote Desktop and Enabling Self-Login</link> on the hadoop wiki.</para>
</section>
<section
<term>ssh</term>
<listitem>
<para>HBase uses the Secure Shell (ssh) command and utilities extensively to communicate
between cluster nodes. Each server in the cluster must be running <command>ssh</command>
so that the Hadoop and HBase daemons can be managed. You must be able to connect to all
nodes via SSH, including the local node, from the Master as well as any backup Master,
using a shared key rather than a password. You can see the basic methodology for such a
set-up in Linux or Unix systems at <xref
linkend="passwordless.ssh.quickstart" />. If your cluster nodes use OS X, see the
section, <link
xlink:href="http://wiki.apache.org/hadoop/Running_Hadoop_On_OS_X_10.5_64-bit_%28Single-Node_Cluster%29">SSH:
Setting up Remote Desktop and Enabling Self-Login</link> on the Hadoop wiki.</para>
</listitem>
</varlistentry>
<varlistentry
xml:id="dns">
<title>DNS</title>
<term>DNS</term>
<listitem>
<para>HBase uses the local hostname to self-report its IP address. Both forward and
reverse DNS resolving must work in versions of HBase previous to 0.92.0.<footnote>
<para>The <link
xlink:href="https://github.com/sujee/hadoop-dns-checker">hadoop-dns-checker</link>
tool can be used to verify DNS is working correctly on the cluster. The project
README file provides detailed instructions on usage. </para>
</footnote></para>
<para>HBase uses the local hostname to self-report its IP address. Both forward and reverse
DNS resolving must work in versions of HBase previous to 0.92.0 <footnote>
<para>The <link
xlink:href="https://github.com/sujee/hadoop-dns-checker">hadoop-dns-checker</link>
tool can be used to verify DNS is working correctly on the cluster. The project README
file provides detailed instructions on usage. </para>
</footnote>.</para>
<para>If your server has multiple network interfaces, HBase defaults to using the
interface that the primary hostname resolves to. To override this behavior, set the
<code>hbase.regionserver.dns.interface</code> property to a different interface. This
will only work if each server in your cluster uses the same network interface
configuration.</para>
<para>If your machine has multiple interfaces, HBase will use the interface that the primary
hostname resolves to.</para>
<para>If this is insufficient, you can set
<varname>hbase.regionserver.dns.interface</varname> to indicate the primary interface.
This only works if your cluster configuration is consistent and every host has the same
network interface configuration.</para>
<para>Another alternative is setting <varname>hbase.regionserver.dns.nameserver</varname> to
choose a different nameserver than the system wide default.</para>
</section>
<section
<para>To choose a different DNS nameserver than the system default, set the
<varname>hbase.regionserver.dns.nameserver</varname> property to the IP address of
that nameserver.</para>
</listitem>
</varlistentry>
<varlistentry
xml:id="loopback.ip">
<title>Loopback IP</title>
<para>Previous to hbase-0.96.0, HBase expects the loopback IP address to be 127.0.0.1. See <xref
linkend="loopback.ip" /></para>
</section>
<section
<term>Loopback IP</term>
<listitem>
<para>Prior to hbase-0.96.0, HBase only used the IP address
<systemitem>127.0.0.1</systemitem> to refer to <code>localhost</code>, and this could
not be configured. See <xref
linkend="loopback.ip" />.</para>
</listitem>
</varlistentry>
<varlistentry
xml:id="ntp">
<title>NTP</title>
<term>NTP</term>
<listitem>
<para>The clocks on cluster nodes should be synchronized. A small amount of variation is
acceptable, but larger amounts of skew can cause erratic and unexpected behavior. Time
synchronization is one of the first things to check if you see unexplained problems in
your cluster. It is recommended that you run a Network Time Protocol (NTP) service, or
another time-synchronization mechanism, on your cluster, and that all nodes look to the
same service for time synchronization. See the <link
xlink:href="http://www.tldp.org/LDP/sag/html/basic-ntp-config.html">Basic NTP
Configuration</link> at <citetitle>The Linux Documentation Project (TLDP)</citetitle>
to set up NTP.</para>
</listitem>
</varlistentry>
<para>The clocks on cluster members should be in basic alignments. Some skew is tolerable
but wild skew could generate odd behaviors. Run <link
xlink:href="http://en.wikipedia.org/wiki/Network_Time_Protocol">NTP</link> on your
cluster, or an equivalent.</para>
<para>If you are having problems querying data, or "weird" cluster operations, check system
time!</para>
</section>
<section
<varlistentry
xml:id="ulimit">
<title>
<varname>ulimit</varname><indexterm>
<term>Limits on Number of Files and Processes (<command>ulimit</command>)
<indexterm>
<primary>ulimit</primary>
</indexterm> and <varname>nproc</varname><indexterm>
</indexterm><indexterm>
<primary>nproc</primary>
</indexterm>
</title>
</term>
<para>Apache HBase is a database. It uses a lot of files all at the same time. The default
ulimit -n -- i.e. user file limit -- of 1024 on most *nix systems is insufficient (On mac
os x its 256). Any significant amount of loading will lead you to <xref
linkend="trouble.rs.runtime.filehandles" />. You may also notice errors such as the
following:</para>
<screen>
<listitem>
<para>Apache HBase is a database. It requires the ability to open a large number of files
at once. Many Linux distributions limit the number of files a single user is allowed to
open to <literal>1024</literal> (or <literal>256</literal> on older versions of OS X).
You can check this limit on your servers by running the command <command>ulimit
-n</command> when logged in as the user which runs HBase. See <xref
linkend="trouble.rs.runtime.filehandles" /> for some of the problems you may
experience if the limit is too low. You may also notice errors such as the
following:</para>
<screen>
2010-04-06 03:04:37,542 INFO org.apache.hadoop.hdfs.DFSClient: Exception increateBlockOutputStream java.io.EOFException
2010-04-06 03:04:37,542 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-6935524980745310745_1391901
</screen>
<para> Do yourself a favor and change the upper bound on the number of file descriptors. Set
it to north of 10k. The math runs roughly as follows: per ColumnFamily there is at least
one StoreFile and possibly up to 5 or 6 if the region is under load. Multiply the average
number of StoreFiles per ColumnFamily times the number of regions per RegionServer. For
example, assuming that a schema had 3 ColumnFamilies per region with an average of 3
StoreFiles per ColumnFamily, and there are 100 regions per RegionServer, the JVM will open
3 * 3 * 100 = 900 file descriptors (not counting open jar files, config files, etc.) </para>
<para>You should also up the hbase users' <varname>nproc</varname> setting; under load, a
low-nproc setting could manifest as <classname>OutOfMemoryError</classname>. <footnote>
<para>See Jack Levin's <link
xlink:href="">major hdfs issues</link> note up on the user list.</para>
</footnote>
<footnote>
<para>The requirement that a database requires upping of system limits is not peculiar
to Apache HBase. See for example the section <emphasis>Setting Shell Limits for the
Oracle User</emphasis> in <link
xlink:href="http://www.akadia.com/services/ora_linux_install_10g.html"> Short Guide
to install Oracle 10 on Linux</link>.</para>
</footnote></para>
</screen>
<para>It is recommended to raise the ulimit to at least 10,000, but more likely 10,240,
because the value is usually expressed in multiples of 1024. Each ColumnFamily has at
least one StoreFile, and possibly more than 6 StoreFiles if the region is under load.
The number of open files required depends upon the number of ColumnFamilies and the
number of regions. The following is a rough formula for calculating the potential number
of open files on a RegionServer. </para>
<example>
<title>Calculate the Potential Number of Open Files</title>
<screen>(StoreFiles per ColumnFamily) x (regions per RegionServer)</screen>
</example>
<para>For example, assuming that a schema had 3 ColumnFamilies per region with an average
of 3 StoreFiles per ColumnFamily, and there are 100 regions per RegionServer, the JVM
will open 3 * 3 * 100 = 900 file descriptors, not counting open JAR files, configuration
files, and others. Opening a file does not take many resources, and the risk of allowing
a user to open too many files is minimal.</para>
<para>Another related setting is the number of processes a user is allowed to run at once.
In Linux and Unix, the number of processes is set using the <command>ulimit -u</command>
command. This should not be confused with the <command>nproc</command> command, which
controls the number of CPUs available to a given user. Under load, a
<varname>nproc</varname> that is too low can cause OutOfMemoryError exceptions. See
Jack Levin's <link
xlink:href="http://thread.gmane.org/gmane.comp.java.hadoop.hbase.user/16374">major
hdfs issues</link> thread on the hbase-users mailing list, from 2011.</para>
<para>Configuring the fmaximum number of ile descriptors and processes for the user who is
running the HBase process is an operating system configuration, rather than an HBase
configuration. It is also important to be sure that the settings are changed for the
user that actually runs HBase. To see which user started HBase, and that user's ulimit
configuration, look at the first line of the HBase log for that instance.<footnote>
<para>A useful read setting config on you hadoop cluster is Aaron Kimballs' <link
xlink:href="http://www.cloudera.com/blog/2009/03/configuration-parameters-what-can-you-just-ignore/">Configuration
Parameters: What can you just ignore?</link></para>
</footnote></para>
<formalpara xml:id="ulimit_ubuntu">
<title><command>ulimit</command> Settings on Ubuntu</title>
<para>To configure <command>ulimit</command> settings on Ubuntu, edit
<filename>/etc/security/limits.conf</filename>, which is a space-delimited file with
four columns. Refer to the <link
xlink:href="http://manpages.ubuntu.com/manpages/lucid/man5/limits.conf.5.html">man
page for limits.conf</link> for details about the format of this file. In the
following example, the first line sets both soft and hard limits for the number of
open files (<literal>nofile</literal>) to <literal>32768</literal> for the operating
system user with the username <literal>hadoop</literal>. The second line sets the
number of processes to 32000 for the same user.</para>
</formalpara>
<screen>
hadoop - nofile 32768
hadoop - nproc 32000
</screen>
<para>The settings are only applied if the Pluggable Authentication Module (PAM)
environment is directed to use them. To configure PAM to use these limits, be sure that
the <filename>/etc/pam.d/common-session</filename> file contains the following line:</para>
<screen>session required pam_limits.so</screen>
</listitem>
</varlistentry>
<para>To be clear, upping the file descriptors and nproc for the user who is running the
HBase process is an operating system configuration, not an HBase configuration. Also, a
common mistake is that administrators will up the file descriptors for a particular user
but for whatever reason, HBase will be running as some one else. HBase prints in its logs
as the first line the ulimit its seeing. Ensure its correct. <footnote>
<para>A useful read setting config on you hadoop cluster is Aaron Kimballs' <link
xlink:href="http://www.cloudera.com/blog/2009/03/configuration-parameters-what-can-you-just-ignore/">Configuration
Parameters: What can you just ignore?</link></para>
</footnote></para>
<section
xml:id="ulimit_ubuntu">
<title><varname>ulimit</varname> on Ubuntu</title>
<para>If you are on Ubuntu you will need to make the following changes:</para>
<para>In the file <filename>/etc/security/limits.conf</filename> add a line like:</para>
<programlisting>hadoop - nofile 32768</programlisting>
<para>Replace <varname>hadoop</varname> with whatever user is running Hadoop and HBase. If
you have separate users, you will need 2 entries, one for each user. In the same file
set nproc hard and soft limits. For example:</para>
<programlisting>hadoop soft/hard nproc 32000</programlisting>
<para>In the file <filename>/etc/pam.d/common-session</filename> add as the last line in
the file: <programlisting>session required pam_limits.so</programlisting> Otherwise the
changes in <filename>/etc/security/limits.conf</filename> won't be applied.</para>
<para>Don't forget to log out and back in again for the changes to take effect!</para>
</section>
</section>
<section
<varlistentry
xml:id="windows">
<title>Windows</title>
<term>Windows</term>
<para>Previous to hbase-0.96.0, Apache HBase was little tested running on Windows. Running a
production install of HBase on top of Windows is not recommended.</para>
<listitem>
<para>Prior to HBase 0.96, testing for running HBase on Microsoft Windows was limited.
Running a on Windows nodes is not recommended for production systems.</para>
<para>If you are running HBase on Windows pre-hbase-0.96.0, you must install <link
xlink:href="http://cygwin.com/">Cygwin</link> to have a *nix-like environment for the
shell scripts. The full details are explained in the <link
<para>To run versions of HBase prior to 0.96 on Microsoft Windows, you must install <link
xlink:href="http://cygwin.com/">Cygwin</link> and run HBase within the Cygwin
environment. This provides support for Linux/Unix commands and scripts. The full details are explained in the <link
xlink:href="http://hbase.apache.org/cygwin.html">Windows Installation</link> guide. Also <link
xlink:href="http://search-hadoop.com/?q=hbase+windows&amp;fc_project=HBase&amp;fc_type=mail+_hash_+dev">search
our user mailing list</link> to pick up latest fixes figured by Windows users.</para>
<para>Post-hbase-0.96.0, hbase runs natively on windows with supporting
<command>*.cmd</command> scripts bundled. </para>
</section>
<command>*.cmd</command> scripts bundled. </para></listitem>
</varlistentry>
</section>
</variablelist>
<!-- OS -->
<section
@ -259,17 +350,18 @@
xlink:href="http://hadoop.apache.org">Hadoop</link><indexterm>
<primary>Hadoop</primary>
</indexterm></title>
<para>The below table shows some information about what versions of Hadoop are supported by
various HBase versions. Based on the version of HBase, you should select the most
appropriate version of Hadoop. We are not in the Hadoop distro selection business. You can
use Hadoop distributions from Apache, or learn about vendor distributions of Hadoop at <link
xlink:href="http://wiki.apache.org/hadoop/Distributions%20and%20Commercial%20Support" /></para>
<para>The following table summarizes the versions of Hadoop supported with each version of
HBase. Based on the version of HBase, you should select the most
appropriate version of Hadoop. You can use Apache Hadoop, or a vendor's distribution of
Hadoop. No distinction is made here. See <link
xlink:href="http://wiki.apache.org/hadoop/Distributions%20and%20Commercial%20Support" />
for information about vendors of Hadoop.</para>
<tip>
<title>Hadoop 2.x is better than Hadoop 1.x</title>
<para>Hadoop 2.x is faster, with more features such as short-circuit reads which will help
improve your HBase random read profile as well important bug fixes that will improve your
overall HBase experience. You should run Hadoop 2 rather than Hadoop 1. HBase 0.98
deprecates use of Hadoop1. HBase 1.0 will not support Hadoop1. </para>
<title>Hadoop 2.x is recommended.</title>
<para>Hadoop 2.x is faster and includes features, such as short-circuit reads, which will
help improve your HBase random read profile. Hadoop 2.x also includes important bug fixes
that will improve your overall HBase experience. HBase 0.98 deprecates use of Hadoop 1.x,
and HBase 1.0 will not support Hadoop 1.x.</para>
</tip>
<para>Use the following legend to interpret this table:</para>
<simplelist
@ -618,7 +710,9 @@ Index: pom.xml
instance of the <emphasis>Hadoop Distributed File System</emphasis> (HDFS).
Fully-distributed mode can ONLY run on HDFS. See the Hadoop <link
xlink:href="http://hadoop.apache.org/common/docs/r1.1.1/api/overview-summary.html#overview_description">
requirements and instructions</link> for how to set up HDFS.</para>
requirements and instructions</link> for how to set up HDFS for Hadoop 1.x. A good
walk-through for setting up HDFS on Hadoop 2 is at <link
xlink:href="http://www.alexjf.net/blog/distributed-systems/hadoop-yarn-installation-definitive-guide">http://www.alexjf.net/blog/distributed-systems/hadoop-yarn-installation-definitive-guide</link>.</para>
<para>Below we describe the different distributed setups. Starting, verification and
exploration of your install, whether a <emphasis>pseudo-distributed</emphasis> or
@ -628,207 +722,139 @@ Index: pom.xml
<section
xml:id="pseudo">
<title>Pseudo-distributed</title>
<note>
<title>Pseudo-Distributed Quickstart</title>
<para>A quickstart has been added to the <xref
linkend="quickstart" /> chapter. See <xref
linkend="quickstart-pseudo" />. Some of the information that was originally in this
section has been moved there.</para>
</note>
<para>A pseudo-distributed mode is simply a fully-distributed mode run on a single host. Use
this configuration testing and prototyping on HBase. Do not use this configuration for
production nor for evaluating HBase performance.</para>
<para>First, if you want to run on HDFS rather than on the local filesystem, setup your
HDFS. You can set up HDFS also in pseudo-distributed mode (TODO: Add pointer to HOWTO doc;
the hadoop site doesn't have any any more). Ensure you have a working HDFS before
proceeding. </para>
<para>Next, configure HBase. Edit <filename>conf/hbase-site.xml</filename>. This is the file
into which you add local customizations and overrides. At a minimum, you must tell HBase
to run in (pseudo-)distributed mode rather than in default standalone mode. To do this,
set the <varname>hbase.cluster.distributed</varname> property to true (Its default is
<varname>false</varname>). The absolute bare-minimum <filename>hbase-site.xml</filename>
is therefore as follows:</para>
<programlisting><![CDATA[
<configuration>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
</configuration>
]]>
</programlisting>
<para>With this configuration, HBase will start up an HBase Master process, a ZooKeeper
server, and a RegionServer process running against the local filesystem writing to
wherever your operating system stores temporary files into a directory named
<filename>hbase-YOUR_USER_NAME</filename>.</para>
<para>Such a setup, using the local filesystem and writing to the operating systems's
temporary directory is an ephemeral setup; the Hadoop local filesystem -- which is what
HBase uses when it is writing the local filesytem -- would lose data unless the system
was shutdown properly in versions of HBase before 0.98.4 and 1.0.0 (see
<link xlink:href="https://issues.apache.org/jira/browse/HBASE-11218">HBASE-11218 Data
loss in HBase standalone mode</link>). Writing to the operating
system's temporary directory can also make for data loss when the machine is restarted as
this directory is usually cleared on reboot. For a more permanent setup, see the next
example where we make use of an instance of HDFS; HBase data will be written to the Hadoop
distributed filesystem rather than to the local filesystem's tmp directory.</para>
<para>In this <filename>conf/hbase-site.xml</filename> example, the
<varname>hbase.rootdir</varname> property points to the local HDFS instance homed on the
node <varname>h-24-30.example.com</varname>.</para>
<note>
<title>Let HBase create <filename>${hbase.rootdir}</filename></title>
<para>Let HBase create the <varname>hbase.rootdir</varname> directory. If you don't,
you'll get warning saying HBase needs a migration run because the directory is missing
files expected by HBase (it'll create them if you let it).</para>
</note>
<programlisting>
&lt;configuration&gt;
&lt;property&gt;
&lt;name&gt;hbase.rootdir&lt;/name&gt;
&lt;value&gt;hdfs://h-24-30.sfo.stumble.net:8020/hbase&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;hbase.cluster.distributed&lt;/name&gt;
&lt;value&gt;true&lt;/value&gt;
&lt;/property&gt;
&lt;/configuration&gt;
</programlisting>
<para>Now skip to <xref
linkend="confirm" /> for how to start and verify your pseudo-distributed install. <footnote>
<para>See <xref
linkend="pseudo.extras" /> for notes on how to start extra Masters and RegionServers
when running pseudo-distributed.</para>
</footnote></para>
<section
xml:id="pseudo.extras">
<title>Pseudo-distributed Extras</title>
<section
xml:id="pseudo.extras.start">
<title>Startup</title>
<para>To start up the initial HBase cluster...</para>
<screen>% bin/start-hbase.sh</screen>
<para>To start up an extra backup master(s) on the same server run...</para>
<screen>% bin/local-master-backup.sh start 1</screen>
<para>... the '1' means use ports 16001 &amp; 16011, and this backup master's logfile
will be at <filename>logs/hbase-${USER}-1-master-${HOSTNAME}.log</filename>. </para>
<para>To startup multiple backup masters run...</para>
<screen>% bin/local-master-backup.sh start 2 3</screen>
<para>You can start up to 9 backup masters (10 total). </para>
<para>To start up more regionservers...</para>
<screen>% bin/local-regionservers.sh start 1</screen>
<para>... where '1' means use ports 16201 &amp; 16301 and its logfile will be at
`<filename>logs/hbase-${USER}-1-regionserver-${HOSTNAME}.log</filename>. </para>
<para>To add 4 more regionservers in addition to the one you just started by
running...</para>
<screen>% bin/local-regionservers.sh start 2 3 4 5</screen>
<para>This supports up to 99 extra regionservers (100 total). </para>
</section>
<section
xml:id="pseudo.options.stop">
<title>Stop</title>
<para>Assuming you want to stop master backup # 1, run...</para>
<screen>% cat /tmp/hbase-${USER}-1-master.pid |xargs kill -9</screen>
<para>Note that bin/local-master-backup.sh stop 1 will try to stop the cluster along
with the master. </para>
<para>To stop an individual regionserver, run...</para>
<screen>% bin/local-regionservers.sh stop 1</screen>
</section>
</section>
</section>
</section>
<section
xml:id="fully_dist">
<title>Fully-distributed</title>
<para>By default, HBase runs in standalone mode. Both standalone mode and pseudo-distributed
mode are provided for the purposes of small-scale testing. For a production environment,
distributed mode is appropriate. In distributed mode, multiple instances of HBase daemons
run on multiple servers in the cluster.</para>
<para>Just as in pseudo-distributed mode, a fully distributed configuration requires that you
set the <code>hbase-cluster.distributed</code> property to <literal>true</literal>.
Typically, the <code>hbase.rootdir</code> is configured to point to a highly-available HDFS
filesystem. </para>
<para>In addition, the cluster is configured so that multiple cluster nodes enlist as
RegionServers, ZooKeeper QuorumPeers, and backup HMaster servers. These configuration basics
are all demonstrated in <xref
linkend="quickstart-fully-distributed" />.</para>
<formalpara
xml:id="regionserver">
<title>Distributed RegionServers</title>
<para>Typically, your cluster will contain multiple RegionServers all running on different
servers, as well as primary and backup Master and Zookeeper daemons. The
<filename>conf/regionservers</filename> file on the master server contains a list of
hosts whose RegionServers are associated with this cluster. Each host is on a separate
line. All hosts listed in this file will have their RegionServer processes started and
stopped when the master server starts or stops.</para>
</formalpara>
<formalpara
xml:id="hbase.zookeeper">
<title>ZooKeeper and HBase</title>
<para>See section <xref
linkend="zookeeper" /> for ZooKeeper setup for HBase.</para>
</formalpara>
<section
xml:id="fully_dist">
<title>Fully-distributed</title>
<para>For running a fully-distributed operation on more than one host, make the following
configurations. In <filename>hbase-site.xml</filename>, add the property
<varname>hbase.cluster.distributed</varname> and set it to <varname>true</varname> and
point the HBase <varname>hbase.rootdir</varname> at the appropriate HDFS NameNode and
location in HDFS where you would like HBase to write data. For example, if you namenode
were running at namenode.example.org on port 8020 and you wanted to home your HBase in
HDFS at <filename>/hbase</filename>, make the following configuration.</para>
<example>
<title>Example Distributed HBase Cluster</title>
<para>This is a bare-bones <filename>conf/hbase-site.xml</filename> for a distributed HBase
cluster. A cluster that is used for real-world work would contain more custom
configuration parameters. Most HBase configuration directives have default values, which
are used unless the value is overridden in the <filename>hbase-site.xml</filename>. See <xref
linkend="config.files" /> for more information.</para>
<programlisting><![CDATA[
<configuration>
...
<property>
<name>hbase.rootdir</name>
<value>hdfs://namenode.example.org:8020/hbase</value>
<description>The directory shared by RegionServers.
</description>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
<description>The mode the cluster will be in. Possible values are
false: standalone and pseudo-distributed setups with managed Zookeeper
true: fully-distributed with unmanaged Zookeeper Quorum (see hbase-env.sh)
</description>
</property>
...
<property>
<name>hbase.zookeeper.quorum</name>
<value>node-a.example.com,node-b.example.com,node-c.example.com</value>
</property>
</configuration>
]]>
</programlisting>
<para>This is an example <filename>conf/regionservers</filename> file, which contains a list
of each node that should run a RegionServer in the cluster. These nodes need HBase
installed and they need to use the same contents of the <filename>conf/</filename>
directory as the Master server..</para>
<programlisting>
node-a.example.com
node-b.example.com
node-c.example.com
</programlisting>
<para>This is an example <filename>conf/backup-masters</filename> file, which contains a
list of each node that should run a backup Master instance. The backup Master instances
will sit idle unless the main Master becomes unavailable.</para>
<programlisting>
node-b.example.com
node-c.example.com
</programlisting>
</example>
<formalpara>
<title>Distributed HBase Quickstart</title>
<para>See <xref
linkend="quickstart-fully-distributed" /> for a walk-through of a simple three-node
cluster configuration with multiple ZooKeeper, backup HMaster, and RegionServer
instances.</para>
</formalpara>
<section
xml:id="regionserver">
<title><filename>regionservers</filename></title>
<para>In addition, a fully-distributed mode requires that you modify
<filename>conf/regionservers</filename>. The <xref
linkend="regionservers" /> file lists all hosts that you would have running
<application>HRegionServer</application>s, one host per line (This file in HBase is
like the Hadoop <filename>slaves</filename> file). All servers listed in this file will
be started and stopped when HBase cluster start or stop is run.</para>
</section>
<section
xml:id="hbase.zookeeper">
<title>ZooKeeper and HBase</title>
<para>See section <xref
linkend="zookeeper" /> for ZooKeeper setup for HBase.</para>
</section>
<section
xml:id="hdfs_client_conf">
<title>HDFS Client Configuration</title>
<para>Of note, if you have made <emphasis>HDFS client configuration</emphasis> on your
Hadoop cluster -- i.e. configuration you want HDFS clients to use as opposed to
server-side configurations -- HBase will not see this configuration unless you do one of
the following:</para>
<itemizedlist>
<listitem>
<procedure
xml:id="hdfs_client_conf">
<title>HDFS Client Configuration</title>
<step>
<para>Of note, if you have made HDFS client configuration on your Hadoop cluster, such as
configuration directives for HDFS clients, as opposed to server-side configurations, you
must use one of the following methods to enable HBase to see and use these configuration
changes:</para>
<stepalternatives>
<step>
<para>Add a pointer to your <varname>HADOOP_CONF_DIR</varname> to the
<varname>HBASE_CLASSPATH</varname> environment variable in
<filename>hbase-env.sh</filename>.</para>
</listitem>
</step>
<listitem>
<step>
<para>Add a copy of <filename>hdfs-site.xml</filename> (or
<filename>hadoop-site.xml</filename>) or, better, symlinks, under
<filename>${HBASE_HOME}/conf</filename>, or</para>
</listitem>
</step>
<listitem>
<step>
<para>if only a small set of HDFS client configurations, add them to
<filename>hbase-site.xml</filename>.</para>
</listitem>
</itemizedlist>
<para>An example of such an HDFS client configuration is
<varname>dfs.replication</varname>. If for example, you want to run with a replication
factor of 5, hbase will create files with the default of 3 unless you do the above to
make the configuration available to HBase.</para>
</section>
</section>
</step>
</stepalternatives>
</step>
</procedure>
<para>An example of such an HDFS client configuration is <varname>dfs.replication</varname>.
If for example, you want to run with a replication factor of 5, hbase will create files with
the default of 3 unless you do the above to make the configuration available to
HBase.</para>
</section>
</section>
<section
xml:id="confirm">
@ -871,7 +897,7 @@ stopping hbase...............</screen>
of many machines. If you are running a distributed operation, be sure to wait until HBase
has shut down completely before stopping the Hadoop daemons.</para>
</section>
</section>
<!-- run modes -->

View File

@ -40,46 +40,51 @@
<section
xml:id="quickstart">
<title>Quick Start</title>
<title>Quick Start - Standalone HBase</title>
<para>This guide describes setup of a standalone HBase instance. It will run against the local
filesystem. In later sections we will take you through how to run HBase on Apache Hadoop's
HDFS, a distributed filesystem. This section shows you how to create a table in HBase,
inserting rows into your new HBase table via the HBase <command>shell</command>, and then
cleaning up and shutting down your standalone, local filesystem-based HBase instance. The
below exercise should take no more than ten minutes (not including download time). </para>
<note
<para>This guide describes setup of a standalone HBase instance running against the local
filesystem. This is not an appropriate configuration for a production instance of HBase, but
will allow you to experiment with HBase. This section shows you how to create a table in
HBase using the <command>hbase shell</command> CLI, insert rows into the table, perform put
and scan operations against the table, enable or disable the table, and start and stop HBase.
Apart from downloading HBase, this procedure should take less than 10 minutes.</para>
<warning
xml:id="local.fs.durability">
<title>Local Filesystem and Durability</title>
<para>Using HBase with a LocalFileSystem does not currently guarantee durability. The HDFS
local filesystem implementation will lose edits if files are not properly closed -- which is
very likely to happen when experimenting with a new download. You need to run HBase on HDFS
to ensure all writes are preserved. Running against the local filesystem though will get you
off the ground quickly and get you familiar with how the general system works so lets run
with it for now. See <link
<para><emphasis>The below advice is for HBase 0.98.2 and earlier releases only. This is fixed
in HBase 0.98.3 and beyond. See <link
xlink:href="https://issues.apache.org/jira/browse/HBASE-11272">HBASE-11272</link> and
<link
xlink:href="https://issues.apache.org/jira/browse/HBASE-11218">HBASE-11218</link>.</emphasis></para>
<para>Using HBase with a local filesystem does not guarantee durability. The HDFS
local filesystem implementation will lose edits if files are not properly closed. This is
very likely to happen when you are experimenting with new software, starting and stopping
the daemons often and not always cleanly. You need to run HBase on HDFS
to ensure all writes are preserved. Running against the local filesystem is intended as a
shortcut to get you familiar with how the general system works, as the very first phase of
evaluation. See <link
xlink:href="https://issues.apache.org/jira/browse/HBASE-3696" /> and its associated issues
for more details.</para>
</note>
for more details about the issues of running on the local filesystem.</para>
</warning>
<note
xml:id="loopback.ip.getting.started">
<title>Loopback IP</title>
<para><emphasis>The below advice is for hbase-0.94.x and older versions only. We believe this
fixed in hbase-0.96.0 and beyond (let us know if we have it wrong).</emphasis> There
should be no need of the below modification to <filename>/etc/hosts</filename> in later
versions of HBase.</para>
<title>Loopback IP - HBase 0.94.x and earlier</title>
<para><emphasis>The below advice is for hbase-0.94.x and older versions only. This is fixed in
hbase-0.96.0 and beyond.</emphasis></para>
<para>HBase expects the loopback IP address to be 127.0.0.1. Ubuntu and some other
distributions, for example, will default to 127.0.1.1 and this will cause problems for you <footnote>
<para>See <link
xlink:href="http://blog.devving.com/why-does-hbase-care-about-etchosts/">Why does
HBase care about /etc/hosts?</link> for detail.</para>
</footnote>. </para>
<para><filename>/etc/hosts</filename> should look something like this:</para>
<screen>
<para>Prior to HBase 0.94.x, HBase expected the loopback IP address to be 127.0.0.1. Ubuntu
and some other distributions default to 127.0.1.1 and this will cause problems for you . See <link
xlink:href="http://blog.devving.com/why-does-hbase-care-about-etchosts/">Why does HBase
care about /etc/hosts?</link> for detail.</para>
<example>
<title>Example /etc/hosts File for Ubuntu</title>
<para>The following <filename>/etc/hosts</filename> file works correctly for HBase 0.94.x
and earlier, on Ubuntu. Use this as a template if you run into trouble.</para>
<screen>
127.0.0.1 localhost
127.0.0.1 ubuntu.ubuntu-domain ubuntu
</screen>
</screen>
</example>
</note>
<section>
@ -89,159 +94,611 @@
</section>
<section>
<title>Download and unpack the latest stable release.</title>
<title>Get Started with HBase</title>
<para>Choose a download site from this list of <link
<procedure>
<title>Download, Configure, and Start HBase</title>
<step>
<para>Choose a download site from this list of <link
xlink:href="http://www.apache.org/dyn/closer.cgi/hbase/">Apache Download Mirrors</link>.
Click on the suggested top link. This will take you to a mirror of <emphasis>HBase
Releases</emphasis>. Click on the folder named <filename>stable</filename> and then
download the file that ends in <filename>.tar.gz</filename> to your local filesystem; e.g.
<filename>hbase-0.94.2.tar.gz</filename>.</para>
<para>Decompress and untar your download and then change into the unpacked directory.</para>
<screen><![CDATA[$ tar xfz hbase-<?eval ${project.version}?>.tar.gz
$ cd hbase-<?eval ${project.version}?>]]>
</screen>
<para>At this point, you are ready to start HBase. But before starting it, edit
<filename>conf/hbase-site.xml</filename>, the file you write your site-specific
configurations into. Set <varname>hbase.rootdir</varname>, the directory HBase writes data
to, and <varname>hbase.zookeeper.property.dataDir</varname>, the directory ZooKeeper writes
its data too:</para>
<programlisting><![CDATA[<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
download the binary file that ends in <filename>.tar.gz</filename> to your local filesystem. Be
sure to choose the version that corresponds with the version of Hadoop you are likely to use
later. In most cases, you should choose the file for Hadoop 2, which will be called something
like <filename>hbase-0.98.3-hadoop2-bin.tar.gz</filename>. Do not download the file ending in
<filename>src.tar.gz</filename> for now.</para>
</step>
<step>
<para>Extract the downloaded file, and change to the newly-created directory.</para>
<screen>
$ tar xzvf hbase-<![CDATA[<?eval ${project.version}?>]]>-hadoop2-bin.tar.gz
$ cd hbase-<![CDATA[<?eval ${project.version}?>]]>-hadoop2/
</screen>
</step>
<step>
<para>Edit <filename>conf/hbase-site.xml</filename>, which is the main HBase configuration
file. At this time, you only need to specify the directory on the local filesystem where
HBase and Zookeeper write data. By default, a new directory is created under /tmp. Many
servers are configured to delete the contents of /tmp upon reboot, so you should store
the data elsewhere. The following configuration will store HBase's data in the
<filename>hbase</filename> directory, in the home directory of the user called
<systemitem>testuser</systemitem>. Paste the <markup>&lt;property&gt;</markup> tags beneath the
<markup>&lt;configuration&gt;</markup> tags, which should be empty in a new HBase install.</para>
<example>
<title>Example <filename>hbase-site.xml</filename> for Standalone HBase</title>
<programlisting><![CDATA[
<configuration>
<property>
<name>hbase.rootdir</name>
<value>file:///DIRECTORY/hbase</value>
<value>file:///home/testuser/hbase</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/DIRECTORY/zookeeper</value>
<value>/home/testuser/zookeeper</value>
</property>
</configuration>]]></programlisting>
<para> Replace <varname>DIRECTORY</varname> in the above with the path to the directory you
would have HBase and ZooKeeper write their data. By default,
<varname>hbase.rootdir</varname> is set to <filename>/tmp/hbase-${user.name}</filename>
and similarly so for the default ZooKeeper data location which means you'll lose all your
data whenever your server reboots unless you change it (Most operating systems clear
<filename>/tmp</filename> on restart).</para>
</section>
</configuration>
]]>
</programlisting>
</example>
<para>You do not need to create the HBase data directory. HBase will do this for you. If
you create the directory, HBase will attempt to do a migration, which is not what you
want.</para>
</step>
<step xml:id="start_hbase">
<para>The <filename>bin/start-hbase.sh</filename> script is provided as a convenient way
to start HBase. Issue the command, and if all goes well, a message is logged to standard
output showing that HBase started successfully. You can use the <command>jps</command>
command to verify that you have one running process called <literal>HMaster</literal>
and at least one called <literal>HRegionServer</literal>.</para>
<note><para>Java needs to be installed and available. If you get an error indicating that
Java is not installed, but it is on your system, perhaps in a non-standard location,
edit the <filename>conf/hbase-env.sh</filename> file and modify the
<envar>JAVA_HOME</envar> setting to point to the directory that contains
<filename>bin/java</filename> your system.</para></note>
</step>
</procedure>
<section
xml:id="start_hbase">
<title>Start HBase</title>
<para>Now start HBase:</para>
<screen>$ ./bin/start-hbase.sh
starting Master, logging to logs/hbase-user-master-example.org.out</screen>
<para>You should now have a running standalone HBase instance. In standalone mode, HBase runs
all daemons in the the one JVM; i.e. both the HBase and ZooKeeper daemons. HBase logs can be
found in the <filename>logs</filename> subdirectory. Check them out especially if it seems
HBase had trouble starting.</para>
<note>
<title>Is <application>java</application> installed?</title>
<para>All of the above presumes a 1.6 version of Oracle <application>java</application> is
installed on your machine and available on your path (See <xref
linkend="java" />); i.e. when you type <application>java</application>, you see output
that describes the options the java program takes (HBase requires java 6). If this is not
the case, HBase will not start. Install java, edit <filename>conf/hbase-env.sh</filename>,
uncommenting the <envar>JAVA_HOME</envar> line pointing it to your java install, then,
retry the steps above.</para>
</note>
</section>
<section
xml:id="shell_exercises">
<title>Shell Exercises</title>
<para>Connect to your running HBase via the <command>shell</command>.</para>
<screen><![CDATA[$ ./bin/hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version: 0.90.0, r1001068, Fri Sep 24 13:55:42 PDT 2010
hbase(main):001:0>]]> </screen>
<para>Type <command>help</command> and then <command>&lt;RETURN&gt;</command> to see a listing
of shell commands and options. Browse at least the paragraphs at the end of the help
emission for the gist of how variables and command arguments are entered into the HBase
shell; in particular note how table names, rows, and columns, etc., must be quoted.</para>
<para>Create a table named <varname>test</varname> with a single column family named
<varname>cf</varname>. Verify its creation by listing all tables and then insert some
values.</para>
<screen><![CDATA[hbase(main):003:0> create 'test', 'cf'
<procedure xml:id="shell_exercises">
<title>Use HBase For the First Time</title>
<step>
<title>Connect to HBase.</title>
<para>Connect to your running instance of HBase using the <command>hbase shell</command>
command, located in the <filename>bin/</filename> directory of your HBase
install. In this example, some usage and version information that is printed when you
start HBase Shell has been omitted. The HBase Shell prompt ends with a
<literal>&gt;</literal> character.</para>
<screen>
$ <userinput>./bin/hbase shell</userinput>
hbase(main):001:0&gt;
</screen>
</step>
<step>
<title>Display HBase Shell Help Text.</title>
<para>Type <literal>help</literal> and press Enter, to display some basic usage
information for HBase Shell, as well as several example commands. Notice that table
names, rows, columns all must be enclosed in quote characters.</para>
</step>
<step>
<title>Create a table.</title>
<para>Use the <code>create</code> command to create a new table. You must specify the
table name and the ColumnFamily name.</para>
<screen>
hbase&gt; <userinput>create 'test', 'cf'</userinput>
0 row(s) in 1.2200 seconds
hbase(main):003:0> list 'test'
..
1 row(s) in 0.0550 seconds
hbase(main):004:0> put 'test', 'row1', 'cf:a', 'value1'
0 row(s) in 0.0560 seconds
hbase(main):005:0> put 'test', 'row2', 'cf:b', 'value2'
0 row(s) in 0.0370 seconds
hbase(main):006:0> put 'test', 'row3', 'cf:c', 'value3'
0 row(s) in 0.0450 seconds]]></screen>
</screen>
</step>
<step>
<title>List Information About your Table</title>
<para>Use the <code>list</code> command to </para>
<screen>
hbase&gt; <userinput>list 'test'</userinput>
TABLE
test
1 row(s) in 0.0350 seconds
<para>Above we inserted 3 values, one at a time. The first insert is at
<varname>row1</varname>, column <varname>cf:a</varname> with a value of
<varname>value1</varname>. Columns in HBase are comprised of a column family prefix --
<varname>cf</varname> in this example -- followed by a colon and then a column qualifier
suffix (<varname>a</varname> in this case).</para>
=> ["test"]
</screen>
</step>
<step>
<title>Put data into your table.</title>
<para>To put data into your table, use the <code>put</code> command.</para>
<screen>
hbase&gt; <userinput>put 'test', 'row1', 'cf:a', 'value1'</userinput>
0 row(s) in 0.1770 seconds
<para>Verify the data insert by running a scan of the table as follows</para>
hbase&gt; <userinput>put 'test', 'row2', 'cf:b', 'value2'</userinput>
0 row(s) in 0.0160 seconds
<screen><![CDATA[hbase(main):007:0> scan 'test'
ROW COLUMN+CELL
row1 column=cf:a, timestamp=1288380727188, value=value1
row2 column=cf:b, timestamp=1288380738440, value=value2
row3 column=cf:c, timestamp=1288380747365, value=value3
3 row(s) in 0.0590 seconds]]></screen>
hbase&gt; <userinput>put 'test', 'row3', 'cf:c', 'value3'</userinput>
0 row(s) in 0.0260 seconds
</screen>
<para>Here, we insert three values, one at a time. The first insert is at
<literal>row1</literal>, column <literal>cf:a</literal>, with a value of
<literal>value1</literal>. Columns in HBase are comprised of a column family prefix,
<literal>cf</literal> in this example, followed by a colon and then a column qualifier
suffix, <literal>a</literal> in this case.</para>
</step>
<step>
<title>Scan the table for all data at once.</title>
<para>One of the ways to get data from HBase is to scan. Use the <command>scan</command>
command to scan the table for data. You can limit your scan, but for now, all data is
fetched.</para>
<screen>
hbase&gt; <userinput>scan 'test'</userinput>
ROW COLUMN+CELL
row1 column=cf:a, timestamp=1403759475114, value=value1
row2 column=cf:b, timestamp=1403759492807, value=value2
row3 column=cf:c, timestamp=1403759503155, value=value3
3 row(s) in 0.0440 seconds
</screen>
</step>
<step>
<title>Get a single row of data.</title>
<para>To get a single row of data at a time, use the <command>get</command> command.</para>
<screen>
hbase&gt; <userinput>get 'test', 'row1'</userinput>
COLUMN CELL
cf:a timestamp=1403759475114, value=value1
1 row(s) in 0.0230 seconds
</screen>
</step>
<step>
<title>Disable a table.</title>
<para>If you want to delete a table or change its settings, as well as in some other
situations, you need to disable the table first, using the <code>disable</code>
command. You can re-enable it using the <code>enable</code> command.</para>
<screen>
hbase&gt; disable 'test'
0 row(s) in 1.6270 seconds
<para>Get a single row</para>
<screen><![CDATA[hbase(main):008:0> get 'test', 'row1'
COLUMN CELL
cf:a timestamp=1288380727188, value=value1
1 row(s) in 0.0400 seconds]]></screen>
<para>Now, disable and drop your table. This will clean up all done above.</para>
<screen>h<![CDATA[base(main):012:0> disable 'test'
0 row(s) in 1.0930 seconds
hbase(main):013:0> drop 'test'
0 row(s) in 0.0770 seconds ]]></screen>
<para>Exit the shell by typing exit.</para>
<programlisting><![CDATA[hbase(main):014:0> exit]]></programlisting>
hbase&gt; enable 'test'
0 row(s) in 0.4500 seconds
</screen>
<para>Disable the table again if you tested the <command>enable</command> command above:</para>
<screen>
hbase&gt; disable 'test'
0 row(s) in 1.6270 seconds
</screen>
</step>
<step>
<title>Drop the table.</title>
<para>To drop (delete) a table, use the <code>drop</code> command.</para>
<screen>
hbase&gt; drop 'test'
0 row(s) in 0.2900 seconds
</screen>
</step>
<step>
<title>Exit the HBase Shell.</title>
<para>To exit the HBase Shell and disconnect from your cluster, use the
<command>quit</command> command. HBase is still running in the background.</para>
</step>
</procedure>
<procedure
xml:id="stopping">
<title>Stop HBase</title>
<step>
<para>In the same way that the <filename>bin/start-hbase.sh</filename> script is provided
to conveniently start all HBase daemons, the <filename>bin/stop-hbase.sh</filename>
script stops them.</para>
<screen>
$ ./bin/stop-hbase.sh
stopping hbase....................
$
</screen>
</step>
<step>
<para>After issuing the command, it can take several minutes for the processes to shut
down. Use the <command>jps</command> to be sure that the HMaster and HRegionServer
processes are shut down.</para>
</step>
</procedure>
</section>
<section
xml:id="stopping">
<title>Stopping HBase</title>
<para>Stop your hbase instance by running the stop script.</para>
<screen>$ ./bin/stop-hbase.sh
stopping hbase...............</screen>
<section xml:id="quickstart-pseudo">
<title>Intermediate - Pseudo-Distributed Local Install</title>
<para>After working your way through <xref linkend="quickstart" />, you can re-configure HBase
to run in pseudo-distributed mode. Pseudo-distributed mode means
that HBase still runs completely on a single host, but each HBase daemon (HMaster,
HRegionServer, and Zookeeper) runs as a separate process. By default, unless you configure the
<code>hbase.rootdir</code> property as described in <xref linkend="quickstart" />, your data
is still stored in <filename>/tmp/</filename>. In this walk-through, we store your data in
HDFS instead, assuming you have HDFS available. You can skip the HDFS configuration to
continue storing your data in the local filesystem.</para>
<note>
<title>Hadoop Configuration</title>
<para>This procedure assumes that you have configured Hadoop and HDFS on your local system
and or a remote system, and that they are running and available. It also assumes you are
using Hadoop 2. Currently, the documentation on the Hadoop website does not include a
quick start for Hadoop 2, but the guide at <link
xlink:href="http://www.alexjf.net/blog/distributed-systems/hadoop-yarn-installation-definitive-guide">http://www.alexjf.net/blog/distributed-systems/hadoop-yarn-installation-definitive-guide</link>
is a good starting point.</para>
</note>
<procedure>
<step>
<title>Stop HBase if it is running.</title>
<para>If you have just finished <xref linkend="quickstart" /> and HBase is still running,
stop it. This procedure will create a totally new directory where HBase will store its
data, so any databases you created before will be lost.</para>
</step>
<step>
<title>Configure HBase.</title>
<para>
Edit the <filename>hbase-site.xml</filename> configuration. First, add the following
property, which directs HBase to run in distributed mode, with one JVM instance per
daemon.
</para>
<programlisting><![CDATA[
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
]]></programlisting>
<para>Next, change the <code>hbase.rootdir</code> from the local filesystem to the address
of your HDFS instance, using the <code>hdfs:////</code> URI syntax. In this example,
HDFS is running on the localhost at port 8020.</para>
<programlisting><![CDATA[
<property>
<name>hbase.rootdir</name>
<value>hdfs://localhost:8020/hbase</value>
</property>
]]>
</programlisting>
<para>You do not need to create the directory in HDFS. HBase will do this for you. If you
create the directory, HBase will attempt to do a migration, which is not what you
want.</para>
</step>
<step>
<title>Start HBase.</title>
<para>Use the <filename>bin/start-hbase.sh</filename> command to start HBase. If your
system is configured correctly, the <command>jps</command> command should show the
HMaster and HRegionServer processes running.</para>
</step>
<step>
<title>Check the HBase directory in HDFS.</title>
<para>If everything worked correctly, HBase created its directory in HDFS. In the
configuration above, it is stored in <filename>/hbase/</filename> on HDFS. You can use
the <command>hadoop fs</command> command in Hadoop's <filename>bin/</filename> directory
to list this directory.</para>
<screen>
$ <userinput>./bin/hadoop fs -ls /hbase</userinput>
Found 7 items
drwxr-xr-x - hbase users 0 2014-06-25 18:58 /hbase/.tmp
drwxr-xr-x - hbase users 0 2014-06-25 21:49 /hbase/WALs
drwxr-xr-x - hbase users 0 2014-06-25 18:48 /hbase/corrupt
drwxr-xr-x - hbase users 0 2014-06-25 18:58 /hbase/data
-rw-r--r-- 3 hbase users 42 2014-06-25 18:41 /hbase/hbase.id
-rw-r--r-- 3 hbase users 7 2014-06-25 18:41 /hbase/hbase.version
drwxr-xr-x - hbase users 0 2014-06-25 21:49 /hbase/oldWALs
</screen>
</step>
<step>
<title>Create a table and populate it with data.</title>
<para>You can use the HBase Shell to create a table, populate it with data, scan and get
values from it, using the same procedure as in <xref linkend="shell_exercises" />.</para>
</step>
<step>
<title>Start and stop a backup HBase Master (HMaster) server.</title>
<note>
<para>Running multiple HMaster instances on the same hardware does not make sense in a
production environment, in the same way that running a pseudo-distributed cluster does
not make sense for production. This step is offered for testing and learning purposes
only.</para>
</note>
<para>The HMaster server controls the HBase cluster. You can start up to 9 backup HMaster
servers, which makes 10 total HMasters, counting the primary. To start a backup HMaster,
use the <command>local-master-backup.sh</command>. For each backup master you want to
start, add a parameter representing the port offset for that master. Each HMaster uses
two ports (16000 and 16010 by default). The port offset is added to these ports, so
using an offset of 2, the first backup HMaster would use ports 16002 and 16012. The
following command starts 3 backup servers using ports 16002/16012, 16003/16013, and
16005/16015.</para>
<screen>
$ ./bin/local-master-backup.sh 2 3 5
</screen>
<para>To kill a backup master without killing the entire cluster, you need to find its
process ID (PID). The PID is stored in a file with a name like
<filename>/tmp/hbase-<replaceable>USER</replaceable>-<replaceable>X</replaceable>-master.pid</filename>.
The only contents of the file are the PID. You can use the <command>kill -9</command>
command to kill that PID. The following command will kill the master with port offset 1,
but leave the cluster running:</para>
<screen>
$ cat /tmp/hbase-testuser-1-master.pid |xargs kill -9
</screen>
</step>
<step>
<title>Start and stop additional RegionServers</title>
<para>The HRegionServer manages the data in its StoreFiles as directed by the HMaster.
Generally, one HRegionServer runs per node in the cluster. Running multiple
HRegionServers on the same system can be useful for testing in pseudo-distributed mode.
The <command>local-regionservers.sh</command> command allows you to run multiple
RegionServers. It works in a similar way to the
<command>local-master-backup.sh</command> command, in that each parameter you provide
represents the port offset for an instance. Each RegionServer requires two ports, and
the default ports are 16200 and 16300. You can run 99 additional RegionServers, or 100
total, on a server. The following command starts four additional
RegionServers, running on sequential ports starting at 16202/16302.</para>
<screen>
$ .bin/local-regionservers.sh start 2 3 4 5
</screen>
<para>To stop a RegionServer manually, use the <command>local-regionservers.sh</command>
command with the <literal>stop</literal> parameter and the offset of the server to
stop.</para>
<screen>$ .bin/local-regionservers.sh stop 3</screen>
</step>
<step>
<title>Stop HBase.</title>
<para>You can stop HBase the same way as in the <xref
linkend="quickstart" /> procedure, using the
<filename>bin/stop-hbase.sh</filename> command.</para>
</step>
</procedure>
</section>
<section xml:id="quickstart-fully-distributed">
<title>Advanced - Fully Distributed</title>
<para>In reality, you need a fully-distributed configuration to fully test HBase and to use it
in real-world scenarios. In a distributed configuration, the cluster contains multiple
nodes, each of which runs one or more HBase daemon. These include primary and backup Master
instances, multiple Zookeeper nodes, and multiple RegionServer nodes.</para>
<para>This advanced quickstart adds two more nodes to your cluster. The architecture will be
as follows:</para>
<table>
<title>Distributed Cluster Demo Architecture</title>
<tgroup cols="4">
<thead>
<row>
<entry>Node Name</entry>
<entry>Master</entry>
<entry>ZooKeeper</entry>
<entry>RegionServer</entry>
</row>
</thead>
<tbody>
<row>
<entry>node-a.example.com</entry>
<entry>yes</entry>
<entry>yes</entry>
<entry>no</entry>
</row>
<row>
<entry>node-b.example.com</entry>
<entry>backup</entry>
<entry>yes</entry>
<entry>yes</entry>
</row>
<row>
<entry>node-c.example.com</entry>
<entry>no</entry>
<entry>yes</entry>
<entry>yes</entry>
</row>
</tbody>
</tgroup>
</table>
<para>This quickstart assumes that each node is a virtual machine and that they are all on the
same network. It builds upon the previous quickstart, <xref linkend="quickstart-pseudo" />,
assuming that the system you configured in that procedure is now <code>node-a</code>. Stop HBase on <code>node-a</code>
before continuing.</para>
<note>
<para>Be sure that all the nodes have full access to communicate, and that no firewall rules
are in place which could prevent them from talking to each other. If you see any errors like
<literal>no route to host</literal>, check your firewall.</para>
</note>
<procedure xml:id="passwordless.ssh.quickstart">
<title>Configure Password-Less SSH Access</title>
<para><code>node-a</code> needs to be able to log into <code>node-b</code> and
<code>node-c</code> (and to itself) in order to start the daemons. The easiest way to accomplish this is
to use the same username on all hosts, and configure password-less SSH login from
<code>node-a</code> to each of the others. </para>
<step>
<title>On <code>node-a</code>, generate a key pair.</title>
<para>While logged in as the user who will run HBase, generate a SSH key pair, using the
following command:
</para>
<screen>$ ssh-keygen -t rsa</screen>
<para>If the command succeeds, the location of the key pair is printed to standard output.
The default name of the public key is <filename>id_rsa.pub</filename>.</para>
</step>
<step>
<title>Create the directory that will hold the shared keys on the other nodes.</title>
<para>On <code>node-b</code> and <code>node-c</code>, log in as the HBase user and create
a <filename>.ssh/</filename> directory in the user's home directory, if it does not
already exist. If it already exists, be aware that it may already contain other keys.</para>
</step>
<step>
<title>Copy the public key to the other nodes.</title>
<para>Securely copy the public key from <code>node-a</code> to each of the nodes, by
using the <command>scp</command> or some other secure means. On each of the other nodes,
create a new file called <filename>.ssh/authorized_keys</filename> <emphasis>if it does
not already exist</emphasis>, and append the contents of the
<filename>id_rsa.pub</filename> file to the end of it. Note that you also need to do
this for <code>node-a</code> itself.</para>
<screen>$ cat id_rsa.pub &gt;&gt; ~/.ssh/authorized_keys</screen>
</step>
<step>
<title>Test password-less login.</title>
<para>If you performed the procedure correctly, if you SSH from <code>node-a</code> to
either of the other nodes, using the same username, you should not be prompted for a password.
</para>
</step>
<step>
<para>Since <code>node-b</code> will run a backup Master, repeat the procedure above,
substituting <code>node-b</code> everywhere you see <code>node-a</code>. Be sure not to
overwrite your existing <filename>.ssh/authorized_keys</filename> files, but concatenate
the new key onto the existing file using the <code>&gt;&gt;</code> operator rather than
the <code>&gt;</code> operator.</para>
</step>
</procedure>
<procedure>
<title>Prepare <code>node-a</code></title>
<para><code>node-a</code> will run your primary master and ZooKeeper processes, but no
RegionServers.</para>
<step>
<title>Stop the RegionServer from starting on <code>node-a</code>.</title>
<para>Edit <filename>conf/regionservers</filename> and remove the line which contains
<literal>localhost</literal>. Add lines with the hostnames or IP addresses for
<code>node-b</code> and <code>node-c</code>. Even if you did want to run a
RegionServer on <code>node-a</code>, you should refer to it by the hostname the other
servers would use to communicate with it. In this case, that would be
<literal>node-a.example.com</literal>. This enables you to distribute the
configuration to each node of your cluster any hostname conflicts. Save the file.</para>
</step>
<step>
<title>Configure HBase to use <code>node-b</code> as a backup master.</title>
<para>Create a new file in <filename>conf/</filename> called
<filename>backup-masters</filename>, and add a new line to it with the hostname for
<code>node-b</code>. In this demonstration, the hostname is
<literal>node-b.example.com</literal>.</para>
</step>
<step>
<title>Configure ZooKeeper</title>
<para>In reality, you should carefully consider your ZooKeeper configuration. You can find
out more about configuring ZooKeeper in <xref
linkend="zookeeper" />. This configuration will direct HBase to start and manage a
ZooKeeper instance on each node of the cluster.</para>
<para>On <code>node-a</code>, edit <filename>conf/hbase-site.xml</filename> and add the
following properties.</para>
<programlisting><![CDATA[
<property>
<name>hbase.zookeeper.quorum</name>
<value>node-a.example.com,node-b.example.com,node-c.example.com</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/usr/local/zookeeper</value>
</property>
]]></programlisting>
</step>
<step>
<para>Everywhere in your configuration that you have referred to <code>node-a</code> as
<literal>localhost</literal>, change the reference to point to the hostname that
the other nodes will use to refer to <code>node-a</code>. In these examples, the
hostname is <literal>node-a.example.com</literal>.</para>
</step>
</procedure>
<procedure>
<title>Prepare <code>node-b</code> and <code>node-c</code></title>
<para><code>node-b</code> will run a backup master server and a ZooKeeper instance.</para>
<step>
<title>Download and unpack HBase.</title>
<para>Download and unpack HBase to <code>node-b</code>, just as you did for the standalone
and pseudo-distributed quickstarts.</para>
</step>
<step>
<title>Copy the configuration files from <code>node-a</code> to <code>node-b</code>.and
<code>node-c</code>.</title>
<para>Each node of your cluster needs to have the same configuration information. Copy the
contents of the <filename>conf/</filename> directory to the <filename>conf/</filename>
directory on <code>node-b</code> and <code>node-c</code>.</para>
</step>
</procedure>
<procedure>
<title>Start and Test Your Cluster</title>
<step>
<title>Be sure HBase is not running on any node.</title>
<para>If you forgot to stop HBase from previous testing, you will have errors. Check to
see whether HBase is running on any of your nodes by using the <command>jps</command>
command. Look for the processes <literal>HMaster</literal>,
<literal>HRegionServer</literal>, and <literal>HQuorumPeer</literal>. If they exist,
kill them.</para>
</step>
<step>
<title>Start the cluster.</title>
<para>On <code>node-a</code>, issue the <command>start-hbase.sh</command> command. Your
output will be similar to that below.</para>
<screen>
$ <userinput>bin/start-hbase.sh</userinput>
node-c.example.com: starting zookeeper, logging to /home/hbuser/hbase-0.98.3-hadoop2/bin/../logs/hbase-hbuser-zookeeper-node-c.example.com.out
node-a.example.com: starting zookeeper, logging to /home/hbuser/hbase-0.98.3-hadoop2/bin/../logs/hbase-hbuser-zookeeper-node-a.example.com.out
node-b.example.com: starting zookeeper, logging to /home/hbuser/hbase-0.98.3-hadoop2/bin/../logs/hbase-hbuser-zookeeper-node-b.example.com.out
starting master, logging to /home/hbuser/hbase-0.98.3-hadoop2/bin/../logs/hbase-hbuser-master-node-a.example.com.out
node-c.example.com: starting regionserver, logging to /home/hbuser/hbase-0.98.3-hadoop2/bin/../logs/hbase-hbuser-regionserver-node-c.example.com.out
node-b.example.com: starting regionserver, logging to /home/hbuser/hbase-0.98.3-hadoop2/bin/../logs/hbase-hbuser-regionserver-node-b.example.com.out
node-b.example.com: starting master, logging to /home/hbuser/hbase-0.98.3-hadoop2/bin/../logs/hbase-hbuser-master-nodeb.example.com.out
</screen>
<para>ZooKeeper starts first, followed by the master, then the RegionServers, and finally
the backup masters. </para>
</step>
<step>
<title>Verify that the processes are running.</title>
<para>On each node of the cluster, run the <command>jps</command> command and verify that
the correct processes are running on each server. You may see additional Java processes
running on your servers as well, if they are used for other purposes.</para>
<example>
<title><code>node-a</code> <command>jps</command> Output</title>
<screen>
$ <userinput>jps</userinput>
20355 Jps
20071 HQuorumPeer
20137 HMaster
</screen>
</example>
<example>
<title><code>node-b</code> <command>jps</command> Output</title>
<screen>
$ <userinput>jps</userinput>
15930 HRegionServer
16194 Jps
15838 HQuorumPeer
16010 HMaster
</screen>
</example>
<example>
<title><code>node-c</code> <command>jps</command> Output</title>
<screen>
$ <userinput>jps</userinput>
13901 Jps
13639 HQuorumPeer
13737 HRegionServer
</screen>
</example>
<note>
<title>ZooKeeper Process Name</title>
<para>The <code>HQuorumPeer</code> process is a ZooKeeper instance which is controlled
and started by HBase. If you use ZooKeeper this way, it is limited to one instance per
cluster node, , and is appropriate for testing only. If ZooKeeper is run outside of
HBase, the process is called <code>QuorumPeer</code>. For more about ZooKeeper
configuration, including using an external ZooKeeper instance with HBase, see <xref
linkend="zookeeper" />.</para>
</note>
</step>
<step>
<title>Browse to the Web UI.</title>
<note>
<title>Web UI Port Changes</title>
<para>In HBase newer than 0.98.x, the HTTP ports used by the HBase Web UI changed from
60010 for the Master and 60030 for each RegionServer to 16610 for the Master and 16030
for the RegionServer.</para>
</note>
<para>If everything is set up correctly, you should be able to connect to the UI for the
Master <literal>http://node-a.example.com:60110/</literal> or the secondary master at
<literal>http://node-b.example.com:60110/</literal> for the secondary master, using a
web browser. If you can connect via <code>localhost</code> but not from another host,
check your firewall rules. You can see the web UI for each of the RegionServers at port
60130 of their IP addresses, or by clicking their links in the web UI for the
Master.</para>
</step>
<step>
<title>Test what happens when nodes or services disappear.</title>
<para>With a three-node cluster like you have configured, things will not be very
resilient. Still, you can test what happens when the primary Master or a RegionServer
disappears, by killing the processes and watching the logs.</para>
</step>
</procedure>
</section>
<section>
<title>Where to go next</title>
<para>The above described standalone setup is good for testing and experiments only. In the
next chapter, <xref
linkend="configuration" />, we'll go into depth on the different HBase run modes, system
requirements running HBase, and critical configurations setting up a distributed HBase
deploy.</para>
<para>The next chapter, <xref
linkend="configuration" />, gives more information about the different HBase run modes,
system requirements for running HBase, and critical configuration areas for setting up a
distributed HBase cluster.</para>
</section>
</section>
</chapter>