HBASE-11399 Improve Quickstart chapter and move Pseudo-distributed and distrbuted into it (Misty Stanley-Jones)
This commit is contained in:
parent
20cac213af
commit
15831cefd5
|
@ -29,228 +29,319 @@
|
|||
*/
|
||||
-->
|
||||
<title>Apache HBase Configuration</title>
|
||||
<para>This chapter is the Not-So-Quick start guide to Apache HBase configuration. It goes over
|
||||
system requirements, Hadoop setup, the different Apache HBase run modes, and the various
|
||||
configurations in HBase. Please read this chapter carefully. At a minimum ensure that all <xref
|
||||
linkend="basic.prerequisites" /> have been satisfied. Failure to do so will cause you (and us)
|
||||
grief debugging strange errors and/or data loss.</para>
|
||||
<para>This chapter expands upon the <xref linkend="getting_started" /> chapter to further explain
|
||||
configuration of Apache HBase. Please read this chapter carefully, especially <xref
|
||||
linkend="basic.prerequisites" /> to ensure that your HBase testing and deployment goes
|
||||
smoothly, and prevent data loss.</para>
|
||||
|
||||
<para> Apache HBase uses the same configuration system as Apache Hadoop. To configure a deploy,
|
||||
edit a file of environment variables in <filename>conf/hbase-env.sh</filename> -- this
|
||||
configuration is used mostly by the launcher shell scripts getting the cluster off the ground --
|
||||
and then add configuration to an XML file to do things like override HBase defaults, tell HBase
|
||||
what Filesystem to use, and the location of the ZooKeeper ensemble. <footnote>
|
||||
<para> Be careful editing XML. Make sure you close all elements. Run your file through
|
||||
<command>xmllint</command> or similar to ensure well-formedness of your document after an
|
||||
edit session. </para>
|
||||
</footnote></para>
|
||||
<para> Apache HBase uses the same configuration system as Apache Hadoop. All configuration files
|
||||
are located in the <filename>conf/</filename> directory, which needs to be kept in sync for each
|
||||
node on your cluster.</para>
|
||||
|
||||
<variablelist>
|
||||
<title>HBase Configuration Files</title>
|
||||
<varlistentry>
|
||||
<term><filename>backup-masters</filename></term>
|
||||
<listitem>
|
||||
<para>Not present by default. A plain-text file which lists hosts on which the Master should
|
||||
start a backup Master process, one host per line.</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
<varlistentry>
|
||||
<term><filename>hadoop-metrics2-hbase.properties</filename></term>
|
||||
<listitem>
|
||||
<para>Used to connect HBase Hadoop's Metrics2 framework. See the <link
|
||||
xlink:href="http://wiki.apache.org/hadoop/HADOOP-6728-MetricsV2">Hadoop Wiki
|
||||
entry</link> for more information on Metrics2. Contains only commented-out examples by
|
||||
default.</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
<varlistentry>
|
||||
<term><filename>hbase-env.cmd</filename> and <filename>hbase-env.sh</filename></term>
|
||||
<listitem>
|
||||
<para>Script for Windows and Linux / Unix environments to set up the working environment for
|
||||
HBase, including the location of Java, Java options, and other environment variables. The
|
||||
file contains many commented-out examples to provide guidance.</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
<varlistentry>
|
||||
<term><filename>hbase-policy.xml</filename></term>
|
||||
<listitem>
|
||||
<para>The default policy configuration file used by RPC servers to make authorization
|
||||
decisions on client requests. Only used if HBase security (<xref
|
||||
linkend="security" />) is enabled.</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
<varlistentry>
|
||||
<term><filename>hbase-site.xml</filename></term>
|
||||
<listitem>
|
||||
<para>The main HBase configuration file. This file specifies configuration options which
|
||||
override HBase's default configuration. You can view (but do not edit) the default
|
||||
configuration file at <filename>docs/hbase-default.xml</filename>. You can also view the
|
||||
entire effective configuration for your cluster (defaults and overrides) in the
|
||||
<guilabel>HBase Configuration</guilabel> tab of the HBase Web UI.</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
<varlistentry>
|
||||
<term><filename>log4j.properties</filename></term>
|
||||
<listitem>
|
||||
<para>Configuration file for HBase logging via <code>log4j</code>.</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
<varlistentry>
|
||||
<term><filename>regionservers</filename></term>
|
||||
<listitem>
|
||||
<para>A plain-text file containing a list of hosts which should run a RegionServer in your
|
||||
HBase cluster. By default this file contains the single entry
|
||||
<literal>localhost</literal>. It should contain a list of hostnames or IP addresses, one
|
||||
per line, and should only contain <literal>localhost</literal> if each node in your
|
||||
cluster will run a RegionServer on its <literal>localhost</literal> interface.</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
</variablelist>
|
||||
|
||||
<tip>
|
||||
<title>Checking XML Validity</title>
|
||||
<para>When you edit XML, it is a good idea to use an XML-aware editor to be sure that your
|
||||
syntax is correct and your XML is well-formed. You can also use the <command>xmllint</command>
|
||||
utility to check that your XML is well-formed. By default, <command>xmllint</command> re-flows
|
||||
and prints the XML to standard output. To check for well-formedness and only print output if
|
||||
errors exist, use the command <command>xmllint -noout
|
||||
<replaceable>filename.xml</replaceable></command>.</para>
|
||||
</tip>
|
||||
|
||||
<para>When running in distributed mode, after you make an edit to an HBase configuration, make
|
||||
sure you copy the content of the <filename>conf</filename> directory to all nodes of the
|
||||
cluster. HBase will not do this for you. Use <command>rsync</command>. For most configuration, a
|
||||
restart is needed for servers to pick up changes (caveat dynamic config. to be described later
|
||||
below).</para>
|
||||
<warning>
|
||||
<title>Keep Configuration In Sync Across the Cluster</title>
|
||||
<para>When running in distributed mode, after you make an edit to an HBase configuration, make
|
||||
sure you copy the content of the <filename>conf/</filename> directory to all nodes of the
|
||||
cluster. HBase will not do this for you. Use <command>rsync</command>, <command>scp</command>,
|
||||
or another secure mechanism for copying the configuration files to your nodes. For most
|
||||
configuration, a restart is needed for servers to pick up changes An exception is dynamic
|
||||
configuration. to be described later below.</para>
|
||||
</warning>
|
||||
|
||||
<section
|
||||
xml:id="basic.prerequisites">
|
||||
<title>Basic Prerequisites</title>
|
||||
<para>This section lists required services and some required system configuration. </para>
|
||||
|
||||
<section
|
||||
<table
|
||||
xml:id="java">
|
||||
<title>Java</title>
|
||||
<para>HBase requires at least Java 6 from <link
|
||||
xlink:href="http://www.java.com/download/">Oracle</link>. The following table lists which JDK version are
|
||||
compatible with each version of HBase.</para>
|
||||
<informaltable>
|
||||
<tgroup cols="4">
|
||||
<thead>
|
||||
<row>
|
||||
<entry>HBase Version</entry>
|
||||
<entry>JDK 6</entry>
|
||||
<entry>JDK 7</entry>
|
||||
<entry>JDK 8</entry>
|
||||
</row>
|
||||
</thead>
|
||||
<tbody>
|
||||
<row>
|
||||
<entry>1.0</entry>
|
||||
<entry><link xlink:href="http://search-hadoop.com/m/DHED4Zlz0R1">Not Supported</link></entry>
|
||||
<entry>yes</entry>
|
||||
<entry><para>Running with JDK 8 will work but is not well tested.</para></entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>0.98</entry>
|
||||
<entry>yes</entry>
|
||||
<entry>yes</entry>
|
||||
<entry><para>Running with JDK 8 works but is not well tested. Building with JDK 8
|
||||
would require removal of the deprecated remove() method of the PoolMap class and is
|
||||
under consideration. See ee <link
|
||||
xlink:href="https://issues.apache.org/jira/browse/HBASE-7608">HBASE-7608</link> for
|
||||
more information about JDK 8 support.</para></entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>0.96</entry>
|
||||
<entry>yes</entry>
|
||||
<entry>yes</entry>
|
||||
<entry></entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>0.94</entry>
|
||||
<entry>yes</entry>
|
||||
<entry>yes</entry>
|
||||
<entry></entry>
|
||||
</row>
|
||||
</tbody>
|
||||
</tgroup>
|
||||
</informaltable>
|
||||
</section>
|
||||
<textobject>
|
||||
<para>HBase requires at least Java 6 from <link
|
||||
xlink:href="http://www.java.com/download/">Oracle</link>. The following table lists
|
||||
which JDK version are compatible with each version of HBase.</para>
|
||||
</textobject>
|
||||
<tgroup
|
||||
cols="4">
|
||||
<thead>
|
||||
<row>
|
||||
<entry>HBase Version</entry>
|
||||
<entry>JDK 6</entry>
|
||||
<entry>JDK 7</entry>
|
||||
<entry>JDK 8</entry>
|
||||
</row>
|
||||
</thead>
|
||||
<tbody>
|
||||
<row>
|
||||
<entry>1.0</entry>
|
||||
<entry><link
|
||||
xlink:href="http://search-hadoop.com/m/DHED4Zlz0R1">Not Supported</link></entry>
|
||||
<entry>yes</entry>
|
||||
<entry><para>Running with JDK 8 will work but is not well tested.</para></entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>0.98</entry>
|
||||
<entry>yes</entry>
|
||||
<entry>yes</entry>
|
||||
<entry><para>Running with JDK 8 works but is not well tested. Building with JDK 8 would
|
||||
require removal of the deprecated remove() method of the PoolMap class and is under
|
||||
consideration. See ee <link
|
||||
xlink:href="https://issues.apache.org/jira/browse/HBASE-7608">HBASE-7608</link>
|
||||
for more information about JDK 8 support.</para></entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>0.96</entry>
|
||||
<entry>yes</entry>
|
||||
<entry>yes</entry>
|
||||
<entry />
|
||||
</row>
|
||||
<row>
|
||||
<entry>0.94</entry>
|
||||
<entry>yes</entry>
|
||||
<entry>yes</entry>
|
||||
<entry />
|
||||
</row>
|
||||
</tbody>
|
||||
</tgroup>
|
||||
</table>
|
||||
|
||||
<section
|
||||
<variablelist
|
||||
xml:id="os">
|
||||
<title>Operating System</title>
|
||||
<section
|
||||
<title>Operating System Utilities</title>
|
||||
<varlistentry
|
||||
xml:id="ssh">
|
||||
<title>ssh</title>
|
||||
|
||||
<para><command>ssh</command> must be installed and <command>sshd</command> must be running
|
||||
to use Hadoop's scripts to manage remote Hadoop and HBase daemons. You must be able to ssh
|
||||
to all nodes, including your local node, using passwordless login (Google "ssh
|
||||
passwordless login"). If on mac osx, see the section, <link
|
||||
xlink:href="http://wiki.apache.org/hadoop/Running_Hadoop_On_OS_X_10.5_64-bit_%28Single-Node_Cluster%29">SSH:
|
||||
Setting up Remote Desktop and Enabling Self-Login</link> on the hadoop wiki.</para>
|
||||
</section>
|
||||
|
||||
<section
|
||||
<term>ssh</term>
|
||||
<listitem>
|
||||
<para>HBase uses the Secure Shell (ssh) command and utilities extensively to communicate
|
||||
between cluster nodes. Each server in the cluster must be running <command>ssh</command>
|
||||
so that the Hadoop and HBase daemons can be managed. You must be able to connect to all
|
||||
nodes via SSH, including the local node, from the Master as well as any backup Master,
|
||||
using a shared key rather than a password. You can see the basic methodology for such a
|
||||
set-up in Linux or Unix systems at <xref
|
||||
linkend="passwordless.ssh.quickstart" />. If your cluster nodes use OS X, see the
|
||||
section, <link
|
||||
xlink:href="http://wiki.apache.org/hadoop/Running_Hadoop_On_OS_X_10.5_64-bit_%28Single-Node_Cluster%29">SSH:
|
||||
Setting up Remote Desktop and Enabling Self-Login</link> on the Hadoop wiki.</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
<varlistentry
|
||||
xml:id="dns">
|
||||
<title>DNS</title>
|
||||
<term>DNS</term>
|
||||
<listitem>
|
||||
<para>HBase uses the local hostname to self-report its IP address. Both forward and
|
||||
reverse DNS resolving must work in versions of HBase previous to 0.92.0.<footnote>
|
||||
<para>The <link
|
||||
xlink:href="https://github.com/sujee/hadoop-dns-checker">hadoop-dns-checker</link>
|
||||
tool can be used to verify DNS is working correctly on the cluster. The project
|
||||
README file provides detailed instructions on usage. </para>
|
||||
</footnote></para>
|
||||
|
||||
<para>HBase uses the local hostname to self-report its IP address. Both forward and reverse
|
||||
DNS resolving must work in versions of HBase previous to 0.92.0 <footnote>
|
||||
<para>The <link
|
||||
xlink:href="https://github.com/sujee/hadoop-dns-checker">hadoop-dns-checker</link>
|
||||
tool can be used to verify DNS is working correctly on the cluster. The project README
|
||||
file provides detailed instructions on usage. </para>
|
||||
</footnote>.</para>
|
||||
<para>If your server has multiple network interfaces, HBase defaults to using the
|
||||
interface that the primary hostname resolves to. To override this behavior, set the
|
||||
<code>hbase.regionserver.dns.interface</code> property to a different interface. This
|
||||
will only work if each server in your cluster uses the same network interface
|
||||
configuration.</para>
|
||||
|
||||
<para>If your machine has multiple interfaces, HBase will use the interface that the primary
|
||||
hostname resolves to.</para>
|
||||
|
||||
<para>If this is insufficient, you can set
|
||||
<varname>hbase.regionserver.dns.interface</varname> to indicate the primary interface.
|
||||
This only works if your cluster configuration is consistent and every host has the same
|
||||
network interface configuration.</para>
|
||||
|
||||
<para>Another alternative is setting <varname>hbase.regionserver.dns.nameserver</varname> to
|
||||
choose a different nameserver than the system wide default.</para>
|
||||
</section>
|
||||
<section
|
||||
<para>To choose a different DNS nameserver than the system default, set the
|
||||
<varname>hbase.regionserver.dns.nameserver</varname> property to the IP address of
|
||||
that nameserver.</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
<varlistentry
|
||||
xml:id="loopback.ip">
|
||||
<title>Loopback IP</title>
|
||||
<para>Previous to hbase-0.96.0, HBase expects the loopback IP address to be 127.0.0.1. See <xref
|
||||
linkend="loopback.ip" /></para>
|
||||
</section>
|
||||
|
||||
<section
|
||||
<term>Loopback IP</term>
|
||||
<listitem>
|
||||
<para>Prior to hbase-0.96.0, HBase only used the IP address
|
||||
<systemitem>127.0.0.1</systemitem> to refer to <code>localhost</code>, and this could
|
||||
not be configured. See <xref
|
||||
linkend="loopback.ip" />.</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
<varlistentry
|
||||
xml:id="ntp">
|
||||
<title>NTP</title>
|
||||
<term>NTP</term>
|
||||
<listitem>
|
||||
<para>The clocks on cluster nodes should be synchronized. A small amount of variation is
|
||||
acceptable, but larger amounts of skew can cause erratic and unexpected behavior. Time
|
||||
synchronization is one of the first things to check if you see unexplained problems in
|
||||
your cluster. It is recommended that you run a Network Time Protocol (NTP) service, or
|
||||
another time-synchronization mechanism, on your cluster, and that all nodes look to the
|
||||
same service for time synchronization. See the <link
|
||||
xlink:href="http://www.tldp.org/LDP/sag/html/basic-ntp-config.html">Basic NTP
|
||||
Configuration</link> at <citetitle>The Linux Documentation Project (TLDP)</citetitle>
|
||||
to set up NTP.</para>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
||||
<para>The clocks on cluster members should be in basic alignments. Some skew is tolerable
|
||||
but wild skew could generate odd behaviors. Run <link
|
||||
xlink:href="http://en.wikipedia.org/wiki/Network_Time_Protocol">NTP</link> on your
|
||||
cluster, or an equivalent.</para>
|
||||
|
||||
<para>If you are having problems querying data, or "weird" cluster operations, check system
|
||||
time!</para>
|
||||
</section>
|
||||
|
||||
<section
|
||||
<varlistentry
|
||||
xml:id="ulimit">
|
||||
<title>
|
||||
<varname>ulimit</varname><indexterm>
|
||||
<term>Limits on Number of Files and Processes (<command>ulimit</command>)
|
||||
<indexterm>
|
||||
<primary>ulimit</primary>
|
||||
</indexterm> and <varname>nproc</varname><indexterm>
|
||||
</indexterm><indexterm>
|
||||
<primary>nproc</primary>
|
||||
</indexterm>
|
||||
</title>
|
||||
</term>
|
||||
|
||||
<para>Apache HBase is a database. It uses a lot of files all at the same time. The default
|
||||
ulimit -n -- i.e. user file limit -- of 1024 on most *nix systems is insufficient (On mac
|
||||
os x its 256). Any significant amount of loading will lead you to <xref
|
||||
linkend="trouble.rs.runtime.filehandles" />. You may also notice errors such as the
|
||||
following:</para>
|
||||
<screen>
|
||||
<listitem>
|
||||
<para>Apache HBase is a database. It requires the ability to open a large number of files
|
||||
at once. Many Linux distributions limit the number of files a single user is allowed to
|
||||
open to <literal>1024</literal> (or <literal>256</literal> on older versions of OS X).
|
||||
You can check this limit on your servers by running the command <command>ulimit
|
||||
-n</command> when logged in as the user which runs HBase. See <xref
|
||||
linkend="trouble.rs.runtime.filehandles" /> for some of the problems you may
|
||||
experience if the limit is too low. You may also notice errors such as the
|
||||
following:</para>
|
||||
<screen>
|
||||
2010-04-06 03:04:37,542 INFO org.apache.hadoop.hdfs.DFSClient: Exception increateBlockOutputStream java.io.EOFException
|
||||
2010-04-06 03:04:37,542 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-6935524980745310745_1391901
|
||||
</screen>
|
||||
<para> Do yourself a favor and change the upper bound on the number of file descriptors. Set
|
||||
it to north of 10k. The math runs roughly as follows: per ColumnFamily there is at least
|
||||
one StoreFile and possibly up to 5 or 6 if the region is under load. Multiply the average
|
||||
number of StoreFiles per ColumnFamily times the number of regions per RegionServer. For
|
||||
example, assuming that a schema had 3 ColumnFamilies per region with an average of 3
|
||||
StoreFiles per ColumnFamily, and there are 100 regions per RegionServer, the JVM will open
|
||||
3 * 3 * 100 = 900 file descriptors (not counting open jar files, config files, etc.) </para>
|
||||
<para>You should also up the hbase users' <varname>nproc</varname> setting; under load, a
|
||||
low-nproc setting could manifest as <classname>OutOfMemoryError</classname>. <footnote>
|
||||
<para>See Jack Levin's <link
|
||||
xlink:href="">major hdfs issues</link> note up on the user list.</para>
|
||||
</footnote>
|
||||
<footnote>
|
||||
<para>The requirement that a database requires upping of system limits is not peculiar
|
||||
to Apache HBase. See for example the section <emphasis>Setting Shell Limits for the
|
||||
Oracle User</emphasis> in <link
|
||||
xlink:href="http://www.akadia.com/services/ora_linux_install_10g.html"> Short Guide
|
||||
to install Oracle 10 on Linux</link>.</para>
|
||||
</footnote></para>
|
||||
</screen>
|
||||
<para>It is recommended to raise the ulimit to at least 10,000, but more likely 10,240,
|
||||
because the value is usually expressed in multiples of 1024. Each ColumnFamily has at
|
||||
least one StoreFile, and possibly more than 6 StoreFiles if the region is under load.
|
||||
The number of open files required depends upon the number of ColumnFamilies and the
|
||||
number of regions. The following is a rough formula for calculating the potential number
|
||||
of open files on a RegionServer. </para>
|
||||
<example>
|
||||
<title>Calculate the Potential Number of Open Files</title>
|
||||
<screen>(StoreFiles per ColumnFamily) x (regions per RegionServer)</screen>
|
||||
</example>
|
||||
<para>For example, assuming that a schema had 3 ColumnFamilies per region with an average
|
||||
of 3 StoreFiles per ColumnFamily, and there are 100 regions per RegionServer, the JVM
|
||||
will open 3 * 3 * 100 = 900 file descriptors, not counting open JAR files, configuration
|
||||
files, and others. Opening a file does not take many resources, and the risk of allowing
|
||||
a user to open too many files is minimal.</para>
|
||||
<para>Another related setting is the number of processes a user is allowed to run at once.
|
||||
In Linux and Unix, the number of processes is set using the <command>ulimit -u</command>
|
||||
command. This should not be confused with the <command>nproc</command> command, which
|
||||
controls the number of CPUs available to a given user. Under load, a
|
||||
<varname>nproc</varname> that is too low can cause OutOfMemoryError exceptions. See
|
||||
Jack Levin's <link
|
||||
xlink:href="http://thread.gmane.org/gmane.comp.java.hadoop.hbase.user/16374">major
|
||||
hdfs issues</link> thread on the hbase-users mailing list, from 2011.</para>
|
||||
<para>Configuring the fmaximum number of ile descriptors and processes for the user who is
|
||||
running the HBase process is an operating system configuration, rather than an HBase
|
||||
configuration. It is also important to be sure that the settings are changed for the
|
||||
user that actually runs HBase. To see which user started HBase, and that user's ulimit
|
||||
configuration, look at the first line of the HBase log for that instance.<footnote>
|
||||
<para>A useful read setting config on you hadoop cluster is Aaron Kimballs' <link
|
||||
xlink:href="http://www.cloudera.com/blog/2009/03/configuration-parameters-what-can-you-just-ignore/">Configuration
|
||||
Parameters: What can you just ignore?</link></para>
|
||||
</footnote></para>
|
||||
<formalpara xml:id="ulimit_ubuntu">
|
||||
<title><command>ulimit</command> Settings on Ubuntu</title>
|
||||
<para>To configure <command>ulimit</command> settings on Ubuntu, edit
|
||||
<filename>/etc/security/limits.conf</filename>, which is a space-delimited file with
|
||||
four columns. Refer to the <link
|
||||
xlink:href="http://manpages.ubuntu.com/manpages/lucid/man5/limits.conf.5.html">man
|
||||
page for limits.conf</link> for details about the format of this file. In the
|
||||
following example, the first line sets both soft and hard limits for the number of
|
||||
open files (<literal>nofile</literal>) to <literal>32768</literal> for the operating
|
||||
system user with the username <literal>hadoop</literal>. The second line sets the
|
||||
number of processes to 32000 for the same user.</para>
|
||||
</formalpara>
|
||||
<screen>
|
||||
hadoop - nofile 32768
|
||||
hadoop - nproc 32000
|
||||
</screen>
|
||||
<para>The settings are only applied if the Pluggable Authentication Module (PAM)
|
||||
environment is directed to use them. To configure PAM to use these limits, be sure that
|
||||
the <filename>/etc/pam.d/common-session</filename> file contains the following line:</para>
|
||||
<screen>session required pam_limits.so</screen>
|
||||
</listitem>
|
||||
</varlistentry>
|
||||
|
||||
<para>To be clear, upping the file descriptors and nproc for the user who is running the
|
||||
HBase process is an operating system configuration, not an HBase configuration. Also, a
|
||||
common mistake is that administrators will up the file descriptors for a particular user
|
||||
but for whatever reason, HBase will be running as some one else. HBase prints in its logs
|
||||
as the first line the ulimit its seeing. Ensure its correct. <footnote>
|
||||
<para>A useful read setting config on you hadoop cluster is Aaron Kimballs' <link
|
||||
xlink:href="http://www.cloudera.com/blog/2009/03/configuration-parameters-what-can-you-just-ignore/">Configuration
|
||||
Parameters: What can you just ignore?</link></para>
|
||||
</footnote></para>
|
||||
|
||||
<section
|
||||
xml:id="ulimit_ubuntu">
|
||||
<title><varname>ulimit</varname> on Ubuntu</title>
|
||||
|
||||
<para>If you are on Ubuntu you will need to make the following changes:</para>
|
||||
|
||||
<para>In the file <filename>/etc/security/limits.conf</filename> add a line like:</para>
|
||||
<programlisting>hadoop - nofile 32768</programlisting>
|
||||
<para>Replace <varname>hadoop</varname> with whatever user is running Hadoop and HBase. If
|
||||
you have separate users, you will need 2 entries, one for each user. In the same file
|
||||
set nproc hard and soft limits. For example:</para>
|
||||
<programlisting>hadoop soft/hard nproc 32000</programlisting>
|
||||
<para>In the file <filename>/etc/pam.d/common-session</filename> add as the last line in
|
||||
the file: <programlisting>session required pam_limits.so</programlisting> Otherwise the
|
||||
changes in <filename>/etc/security/limits.conf</filename> won't be applied.</para>
|
||||
|
||||
<para>Don't forget to log out and back in again for the changes to take effect!</para>
|
||||
</section>
|
||||
</section>
|
||||
|
||||
<section
|
||||
<varlistentry
|
||||
xml:id="windows">
|
||||
<title>Windows</title>
|
||||
<term>Windows</term>
|
||||
|
||||
<para>Previous to hbase-0.96.0, Apache HBase was little tested running on Windows. Running a
|
||||
production install of HBase on top of Windows is not recommended.</para>
|
||||
<listitem>
|
||||
<para>Prior to HBase 0.96, testing for running HBase on Microsoft Windows was limited.
|
||||
Running a on Windows nodes is not recommended for production systems.</para>
|
||||
|
||||
<para>If you are running HBase on Windows pre-hbase-0.96.0, you must install <link
|
||||
xlink:href="http://cygwin.com/">Cygwin</link> to have a *nix-like environment for the
|
||||
shell scripts. The full details are explained in the <link
|
||||
<para>To run versions of HBase prior to 0.96 on Microsoft Windows, you must install <link
|
||||
xlink:href="http://cygwin.com/">Cygwin</link> and run HBase within the Cygwin
|
||||
environment. This provides support for Linux/Unix commands and scripts. The full details are explained in the <link
|
||||
xlink:href="http://hbase.apache.org/cygwin.html">Windows Installation</link> guide. Also <link
|
||||
xlink:href="http://search-hadoop.com/?q=hbase+windows&fc_project=HBase&fc_type=mail+_hash_+dev">search
|
||||
our user mailing list</link> to pick up latest fixes figured by Windows users.</para>
|
||||
<para>Post-hbase-0.96.0, hbase runs natively on windows with supporting
|
||||
<command>*.cmd</command> scripts bundled. </para>
|
||||
</section>
|
||||
<command>*.cmd</command> scripts bundled. </para></listitem>
|
||||
</varlistentry>
|
||||
|
||||
</section>
|
||||
</variablelist>
|
||||
<!-- OS -->
|
||||
|
||||
<section
|
||||
|
@ -259,17 +350,18 @@
|
|||
xlink:href="http://hadoop.apache.org">Hadoop</link><indexterm>
|
||||
<primary>Hadoop</primary>
|
||||
</indexterm></title>
|
||||
<para>The below table shows some information about what versions of Hadoop are supported by
|
||||
various HBase versions. Based on the version of HBase, you should select the most
|
||||
appropriate version of Hadoop. We are not in the Hadoop distro selection business. You can
|
||||
use Hadoop distributions from Apache, or learn about vendor distributions of Hadoop at <link
|
||||
xlink:href="http://wiki.apache.org/hadoop/Distributions%20and%20Commercial%20Support" /></para>
|
||||
<para>The following table summarizes the versions of Hadoop supported with each version of
|
||||
HBase. Based on the version of HBase, you should select the most
|
||||
appropriate version of Hadoop. You can use Apache Hadoop, or a vendor's distribution of
|
||||
Hadoop. No distinction is made here. See <link
|
||||
xlink:href="http://wiki.apache.org/hadoop/Distributions%20and%20Commercial%20Support" />
|
||||
for information about vendors of Hadoop.</para>
|
||||
<tip>
|
||||
<title>Hadoop 2.x is better than Hadoop 1.x</title>
|
||||
<para>Hadoop 2.x is faster, with more features such as short-circuit reads which will help
|
||||
improve your HBase random read profile as well important bug fixes that will improve your
|
||||
overall HBase experience. You should run Hadoop 2 rather than Hadoop 1. HBase 0.98
|
||||
deprecates use of Hadoop1. HBase 1.0 will not support Hadoop1. </para>
|
||||
<title>Hadoop 2.x is recommended.</title>
|
||||
<para>Hadoop 2.x is faster and includes features, such as short-circuit reads, which will
|
||||
help improve your HBase random read profile. Hadoop 2.x also includes important bug fixes
|
||||
that will improve your overall HBase experience. HBase 0.98 deprecates use of Hadoop 1.x,
|
||||
and HBase 1.0 will not support Hadoop 1.x.</para>
|
||||
</tip>
|
||||
<para>Use the following legend to interpret this table:</para>
|
||||
<simplelist
|
||||
|
@ -618,7 +710,9 @@ Index: pom.xml
|
|||
instance of the <emphasis>Hadoop Distributed File System</emphasis> (HDFS).
|
||||
Fully-distributed mode can ONLY run on HDFS. See the Hadoop <link
|
||||
xlink:href="http://hadoop.apache.org/common/docs/r1.1.1/api/overview-summary.html#overview_description">
|
||||
requirements and instructions</link> for how to set up HDFS.</para>
|
||||
requirements and instructions</link> for how to set up HDFS for Hadoop 1.x. A good
|
||||
walk-through for setting up HDFS on Hadoop 2 is at <link
|
||||
xlink:href="http://www.alexjf.net/blog/distributed-systems/hadoop-yarn-installation-definitive-guide">http://www.alexjf.net/blog/distributed-systems/hadoop-yarn-installation-definitive-guide</link>.</para>
|
||||
|
||||
<para>Below we describe the different distributed setups. Starting, verification and
|
||||
exploration of your install, whether a <emphasis>pseudo-distributed</emphasis> or
|
||||
|
@ -628,207 +722,139 @@ Index: pom.xml
|
|||
<section
|
||||
xml:id="pseudo">
|
||||
<title>Pseudo-distributed</title>
|
||||
<note>
|
||||
<title>Pseudo-Distributed Quickstart</title>
|
||||
<para>A quickstart has been added to the <xref
|
||||
linkend="quickstart" /> chapter. See <xref
|
||||
linkend="quickstart-pseudo" />. Some of the information that was originally in this
|
||||
section has been moved there.</para>
|
||||
</note>
|
||||
|
||||
<para>A pseudo-distributed mode is simply a fully-distributed mode run on a single host. Use
|
||||
this configuration testing and prototyping on HBase. Do not use this configuration for
|
||||
production nor for evaluating HBase performance.</para>
|
||||
|
||||
<para>First, if you want to run on HDFS rather than on the local filesystem, setup your
|
||||
HDFS. You can set up HDFS also in pseudo-distributed mode (TODO: Add pointer to HOWTO doc;
|
||||
the hadoop site doesn't have any any more). Ensure you have a working HDFS before
|
||||
proceeding. </para>
|
||||
|
||||
<para>Next, configure HBase. Edit <filename>conf/hbase-site.xml</filename>. This is the file
|
||||
into which you add local customizations and overrides. At a minimum, you must tell HBase
|
||||
to run in (pseudo-)distributed mode rather than in default standalone mode. To do this,
|
||||
set the <varname>hbase.cluster.distributed</varname> property to true (Its default is
|
||||
<varname>false</varname>). The absolute bare-minimum <filename>hbase-site.xml</filename>
|
||||
is therefore as follows:</para>
|
||||
<programlisting><![CDATA[
|
||||
<configuration>
|
||||
<property>
|
||||
<name>hbase.cluster.distributed</name>
|
||||
<value>true</value>
|
||||
</property>
|
||||
</configuration>
|
||||
]]>
|
||||
</programlisting>
|
||||
<para>With this configuration, HBase will start up an HBase Master process, a ZooKeeper
|
||||
server, and a RegionServer process running against the local filesystem writing to
|
||||
wherever your operating system stores temporary files into a directory named
|
||||
<filename>hbase-YOUR_USER_NAME</filename>.</para>
|
||||
|
||||
<para>Such a setup, using the local filesystem and writing to the operating systems's
|
||||
temporary directory is an ephemeral setup; the Hadoop local filesystem -- which is what
|
||||
HBase uses when it is writing the local filesytem -- would lose data unless the system
|
||||
was shutdown properly in versions of HBase before 0.98.4 and 1.0.0 (see
|
||||
<link xlink:href="https://issues.apache.org/jira/browse/HBASE-11218">HBASE-11218 Data
|
||||
loss in HBase standalone mode</link>). Writing to the operating
|
||||
system's temporary directory can also make for data loss when the machine is restarted as
|
||||
this directory is usually cleared on reboot. For a more permanent setup, see the next
|
||||
example where we make use of an instance of HDFS; HBase data will be written to the Hadoop
|
||||
distributed filesystem rather than to the local filesystem's tmp directory.</para>
|
||||
<para>In this <filename>conf/hbase-site.xml</filename> example, the
|
||||
<varname>hbase.rootdir</varname> property points to the local HDFS instance homed on the
|
||||
node <varname>h-24-30.example.com</varname>.</para>
|
||||
<note>
|
||||
<title>Let HBase create <filename>${hbase.rootdir}</filename></title>
|
||||
<para>Let HBase create the <varname>hbase.rootdir</varname> directory. If you don't,
|
||||
you'll get warning saying HBase needs a migration run because the directory is missing
|
||||
files expected by HBase (it'll create them if you let it).</para>
|
||||
</note>
|
||||
<programlisting>
|
||||
<configuration>
|
||||
<property>
|
||||
<name>hbase.rootdir</name>
|
||||
<value>hdfs://h-24-30.sfo.stumble.net:8020/hbase</value>
|
||||
</property>
|
||||
<property>
|
||||
<name>hbase.cluster.distributed</name>
|
||||
<value>true</value>
|
||||
</property>
|
||||
</configuration>
|
||||
</programlisting>
|
||||
|
||||
<para>Now skip to <xref
|
||||
linkend="confirm" /> for how to start and verify your pseudo-distributed install. <footnote>
|
||||
<para>See <xref
|
||||
linkend="pseudo.extras" /> for notes on how to start extra Masters and RegionServers
|
||||
when running pseudo-distributed.</para>
|
||||
</footnote></para>
|
||||
|
||||
<section
|
||||
xml:id="pseudo.extras">
|
||||
<title>Pseudo-distributed Extras</title>
|
||||
|
||||
<section
|
||||
xml:id="pseudo.extras.start">
|
||||
<title>Startup</title>
|
||||
<para>To start up the initial HBase cluster...</para>
|
||||
<screen>% bin/start-hbase.sh</screen>
|
||||
<para>To start up an extra backup master(s) on the same server run...</para>
|
||||
<screen>% bin/local-master-backup.sh start 1</screen>
|
||||
<para>... the '1' means use ports 16001 & 16011, and this backup master's logfile
|
||||
will be at <filename>logs/hbase-${USER}-1-master-${HOSTNAME}.log</filename>. </para>
|
||||
<para>To startup multiple backup masters run...</para>
|
||||
<screen>% bin/local-master-backup.sh start 2 3</screen>
|
||||
<para>You can start up to 9 backup masters (10 total). </para>
|
||||
<para>To start up more regionservers...</para>
|
||||
<screen>% bin/local-regionservers.sh start 1</screen>
|
||||
<para>... where '1' means use ports 16201 & 16301 and its logfile will be at
|
||||
`<filename>logs/hbase-${USER}-1-regionserver-${HOSTNAME}.log</filename>. </para>
|
||||
<para>To add 4 more regionservers in addition to the one you just started by
|
||||
running...</para>
|
||||
<screen>% bin/local-regionservers.sh start 2 3 4 5</screen>
|
||||
<para>This supports up to 99 extra regionservers (100 total). </para>
|
||||
</section>
|
||||
<section
|
||||
xml:id="pseudo.options.stop">
|
||||
<title>Stop</title>
|
||||
<para>Assuming you want to stop master backup # 1, run...</para>
|
||||
<screen>% cat /tmp/hbase-${USER}-1-master.pid |xargs kill -9</screen>
|
||||
<para>Note that bin/local-master-backup.sh stop 1 will try to stop the cluster along
|
||||
with the master. </para>
|
||||
<para>To stop an individual regionserver, run...</para>
|
||||
<screen>% bin/local-regionservers.sh stop 1</screen>
|
||||
</section>
|
||||
|
||||
</section>
|
||||
|
||||
</section>
|
||||
|
||||
</section>
|
||||
|
||||
<section
|
||||
xml:id="fully_dist">
|
||||
<title>Fully-distributed</title>
|
||||
<para>By default, HBase runs in standalone mode. Both standalone mode and pseudo-distributed
|
||||
mode are provided for the purposes of small-scale testing. For a production environment,
|
||||
distributed mode is appropriate. In distributed mode, multiple instances of HBase daemons
|
||||
run on multiple servers in the cluster.</para>
|
||||
<para>Just as in pseudo-distributed mode, a fully distributed configuration requires that you
|
||||
set the <code>hbase-cluster.distributed</code> property to <literal>true</literal>.
|
||||
Typically, the <code>hbase.rootdir</code> is configured to point to a highly-available HDFS
|
||||
filesystem. </para>
|
||||
<para>In addition, the cluster is configured so that multiple cluster nodes enlist as
|
||||
RegionServers, ZooKeeper QuorumPeers, and backup HMaster servers. These configuration basics
|
||||
are all demonstrated in <xref
|
||||
linkend="quickstart-fully-distributed" />.</para>
|
||||
|
||||
<formalpara
|
||||
xml:id="regionserver">
|
||||
<title>Distributed RegionServers</title>
|
||||
<para>Typically, your cluster will contain multiple RegionServers all running on different
|
||||
servers, as well as primary and backup Master and Zookeeper daemons. The
|
||||
<filename>conf/regionservers</filename> file on the master server contains a list of
|
||||
hosts whose RegionServers are associated with this cluster. Each host is on a separate
|
||||
line. All hosts listed in this file will have their RegionServer processes started and
|
||||
stopped when the master server starts or stops.</para>
|
||||
</formalpara>
|
||||
|
||||
<formalpara
|
||||
xml:id="hbase.zookeeper">
|
||||
<title>ZooKeeper and HBase</title>
|
||||
<para>See section <xref
|
||||
linkend="zookeeper" /> for ZooKeeper setup for HBase.</para>
|
||||
</formalpara>
|
||||
|
||||
<section
|
||||
xml:id="fully_dist">
|
||||
<title>Fully-distributed</title>
|
||||
|
||||
<para>For running a fully-distributed operation on more than one host, make the following
|
||||
configurations. In <filename>hbase-site.xml</filename>, add the property
|
||||
<varname>hbase.cluster.distributed</varname> and set it to <varname>true</varname> and
|
||||
point the HBase <varname>hbase.rootdir</varname> at the appropriate HDFS NameNode and
|
||||
location in HDFS where you would like HBase to write data. For example, if you namenode
|
||||
were running at namenode.example.org on port 8020 and you wanted to home your HBase in
|
||||
HDFS at <filename>/hbase</filename>, make the following configuration.</para>
|
||||
|
||||
<example>
|
||||
<title>Example Distributed HBase Cluster</title>
|
||||
<para>This is a bare-bones <filename>conf/hbase-site.xml</filename> for a distributed HBase
|
||||
cluster. A cluster that is used for real-world work would contain more custom
|
||||
configuration parameters. Most HBase configuration directives have default values, which
|
||||
are used unless the value is overridden in the <filename>hbase-site.xml</filename>. See <xref
|
||||
linkend="config.files" /> for more information.</para>
|
||||
<programlisting><![CDATA[
|
||||
<configuration>
|
||||
...
|
||||
<property>
|
||||
<name>hbase.rootdir</name>
|
||||
<value>hdfs://namenode.example.org:8020/hbase</value>
|
||||
<description>The directory shared by RegionServers.
|
||||
</description>
|
||||
</property>
|
||||
<property>
|
||||
<name>hbase.cluster.distributed</name>
|
||||
<value>true</value>
|
||||
<description>The mode the cluster will be in. Possible values are
|
||||
false: standalone and pseudo-distributed setups with managed Zookeeper
|
||||
true: fully-distributed with unmanaged Zookeeper Quorum (see hbase-env.sh)
|
||||
</description>
|
||||
</property>
|
||||
...
|
||||
<property>
|
||||
<name>hbase.zookeeper.quorum</name>
|
||||
<value>node-a.example.com,node-b.example.com,node-c.example.com</value>
|
||||
</property>
|
||||
</configuration>
|
||||
]]>
|
||||
</programlisting>
|
||||
<para>This is an example <filename>conf/regionservers</filename> file, which contains a list
|
||||
of each node that should run a RegionServer in the cluster. These nodes need HBase
|
||||
installed and they need to use the same contents of the <filename>conf/</filename>
|
||||
directory as the Master server..</para>
|
||||
<programlisting>
|
||||
node-a.example.com
|
||||
node-b.example.com
|
||||
node-c.example.com
|
||||
</programlisting>
|
||||
<para>This is an example <filename>conf/backup-masters</filename> file, which contains a
|
||||
list of each node that should run a backup Master instance. The backup Master instances
|
||||
will sit idle unless the main Master becomes unavailable.</para>
|
||||
<programlisting>
|
||||
node-b.example.com
|
||||
node-c.example.com
|
||||
</programlisting>
|
||||
</example>
|
||||
<formalpara>
|
||||
<title>Distributed HBase Quickstart</title>
|
||||
<para>See <xref
|
||||
linkend="quickstart-fully-distributed" /> for a walk-through of a simple three-node
|
||||
cluster configuration with multiple ZooKeeper, backup HMaster, and RegionServer
|
||||
instances.</para>
|
||||
</formalpara>
|
||||
|
||||
<section
|
||||
xml:id="regionserver">
|
||||
<title><filename>regionservers</filename></title>
|
||||
|
||||
<para>In addition, a fully-distributed mode requires that you modify
|
||||
<filename>conf/regionservers</filename>. The <xref
|
||||
linkend="regionservers" /> file lists all hosts that you would have running
|
||||
<application>HRegionServer</application>s, one host per line (This file in HBase is
|
||||
like the Hadoop <filename>slaves</filename> file). All servers listed in this file will
|
||||
be started and stopped when HBase cluster start or stop is run.</para>
|
||||
</section>
|
||||
|
||||
<section
|
||||
xml:id="hbase.zookeeper">
|
||||
<title>ZooKeeper and HBase</title>
|
||||
<para>See section <xref
|
||||
linkend="zookeeper" /> for ZooKeeper setup for HBase.</para>
|
||||
</section>
|
||||
|
||||
<section
|
||||
xml:id="hdfs_client_conf">
|
||||
<title>HDFS Client Configuration</title>
|
||||
|
||||
<para>Of note, if you have made <emphasis>HDFS client configuration</emphasis> on your
|
||||
Hadoop cluster -- i.e. configuration you want HDFS clients to use as opposed to
|
||||
server-side configurations -- HBase will not see this configuration unless you do one of
|
||||
the following:</para>
|
||||
|
||||
<itemizedlist>
|
||||
<listitem>
|
||||
<procedure
|
||||
xml:id="hdfs_client_conf">
|
||||
<title>HDFS Client Configuration</title>
|
||||
<step>
|
||||
<para>Of note, if you have made HDFS client configuration on your Hadoop cluster, such as
|
||||
configuration directives for HDFS clients, as opposed to server-side configurations, you
|
||||
must use one of the following methods to enable HBase to see and use these configuration
|
||||
changes:</para>
|
||||
<stepalternatives>
|
||||
<step>
|
||||
<para>Add a pointer to your <varname>HADOOP_CONF_DIR</varname> to the
|
||||
<varname>HBASE_CLASSPATH</varname> environment variable in
|
||||
<filename>hbase-env.sh</filename>.</para>
|
||||
</listitem>
|
||||
</step>
|
||||
|
||||
<listitem>
|
||||
<step>
|
||||
<para>Add a copy of <filename>hdfs-site.xml</filename> (or
|
||||
<filename>hadoop-site.xml</filename>) or, better, symlinks, under
|
||||
<filename>${HBASE_HOME}/conf</filename>, or</para>
|
||||
</listitem>
|
||||
</step>
|
||||
|
||||
<listitem>
|
||||
<step>
|
||||
<para>if only a small set of HDFS client configurations, add them to
|
||||
<filename>hbase-site.xml</filename>.</para>
|
||||
</listitem>
|
||||
</itemizedlist>
|
||||
|
||||
<para>An example of such an HDFS client configuration is
|
||||
<varname>dfs.replication</varname>. If for example, you want to run with a replication
|
||||
factor of 5, hbase will create files with the default of 3 unless you do the above to
|
||||
make the configuration available to HBase.</para>
|
||||
</section>
|
||||
</section>
|
||||
</step>
|
||||
</stepalternatives>
|
||||
</step>
|
||||
</procedure>
|
||||
<para>An example of such an HDFS client configuration is <varname>dfs.replication</varname>.
|
||||
If for example, you want to run with a replication factor of 5, hbase will create files with
|
||||
the default of 3 unless you do the above to make the configuration available to
|
||||
HBase.</para>
|
||||
</section>
|
||||
</section>
|
||||
|
||||
<section
|
||||
xml:id="confirm">
|
||||
|
@ -871,7 +897,7 @@ stopping hbase...............</screen>
|
|||
of many machines. If you are running a distributed operation, be sure to wait until HBase
|
||||
has shut down completely before stopping the Hadoop daemons.</para>
|
||||
</section>
|
||||
</section>
|
||||
|
||||
<!-- run modes -->
|
||||
|
||||
|
||||
|
|
|
@ -40,46 +40,51 @@
|
|||
|
||||
<section
|
||||
xml:id="quickstart">
|
||||
<title>Quick Start</title>
|
||||
<title>Quick Start - Standalone HBase</title>
|
||||
|
||||
<para>This guide describes setup of a standalone HBase instance. It will run against the local
|
||||
filesystem. In later sections we will take you through how to run HBase on Apache Hadoop's
|
||||
HDFS, a distributed filesystem. This section shows you how to create a table in HBase,
|
||||
inserting rows into your new HBase table via the HBase <command>shell</command>, and then
|
||||
cleaning up and shutting down your standalone, local filesystem-based HBase instance. The
|
||||
below exercise should take no more than ten minutes (not including download time). </para>
|
||||
<note
|
||||
<para>This guide describes setup of a standalone HBase instance running against the local
|
||||
filesystem. This is not an appropriate configuration for a production instance of HBase, but
|
||||
will allow you to experiment with HBase. This section shows you how to create a table in
|
||||
HBase using the <command>hbase shell</command> CLI, insert rows into the table, perform put
|
||||
and scan operations against the table, enable or disable the table, and start and stop HBase.
|
||||
Apart from downloading HBase, this procedure should take less than 10 minutes.</para>
|
||||
<warning
|
||||
xml:id="local.fs.durability">
|
||||
<title>Local Filesystem and Durability</title>
|
||||
<para>Using HBase with a LocalFileSystem does not currently guarantee durability. The HDFS
|
||||
local filesystem implementation will lose edits if files are not properly closed -- which is
|
||||
very likely to happen when experimenting with a new download. You need to run HBase on HDFS
|
||||
to ensure all writes are preserved. Running against the local filesystem though will get you
|
||||
off the ground quickly and get you familiar with how the general system works so lets run
|
||||
with it for now. See <link
|
||||
<para><emphasis>The below advice is for HBase 0.98.2 and earlier releases only. This is fixed
|
||||
in HBase 0.98.3 and beyond. See <link
|
||||
xlink:href="https://issues.apache.org/jira/browse/HBASE-11272">HBASE-11272</link> and
|
||||
<link
|
||||
xlink:href="https://issues.apache.org/jira/browse/HBASE-11218">HBASE-11218</link>.</emphasis></para>
|
||||
<para>Using HBase with a local filesystem does not guarantee durability. The HDFS
|
||||
local filesystem implementation will lose edits if files are not properly closed. This is
|
||||
very likely to happen when you are experimenting with new software, starting and stopping
|
||||
the daemons often and not always cleanly. You need to run HBase on HDFS
|
||||
to ensure all writes are preserved. Running against the local filesystem is intended as a
|
||||
shortcut to get you familiar with how the general system works, as the very first phase of
|
||||
evaluation. See <link
|
||||
xlink:href="https://issues.apache.org/jira/browse/HBASE-3696" /> and its associated issues
|
||||
for more details.</para>
|
||||
</note>
|
||||
for more details about the issues of running on the local filesystem.</para>
|
||||
</warning>
|
||||
<note
|
||||
xml:id="loopback.ip.getting.started">
|
||||
<title>Loopback IP</title>
|
||||
<para><emphasis>The below advice is for hbase-0.94.x and older versions only. We believe this
|
||||
fixed in hbase-0.96.0 and beyond (let us know if we have it wrong).</emphasis> There
|
||||
should be no need of the below modification to <filename>/etc/hosts</filename> in later
|
||||
versions of HBase.</para>
|
||||
<title>Loopback IP - HBase 0.94.x and earlier</title>
|
||||
<para><emphasis>The below advice is for hbase-0.94.x and older versions only. This is fixed in
|
||||
hbase-0.96.0 and beyond.</emphasis></para>
|
||||
|
||||
<para>HBase expects the loopback IP address to be 127.0.0.1. Ubuntu and some other
|
||||
distributions, for example, will default to 127.0.1.1 and this will cause problems for you <footnote>
|
||||
<para>See <link
|
||||
xlink:href="http://blog.devving.com/why-does-hbase-care-about-etchosts/">Why does
|
||||
HBase care about /etc/hosts?</link> for detail.</para>
|
||||
</footnote>. </para>
|
||||
<para><filename>/etc/hosts</filename> should look something like this:</para>
|
||||
<screen>
|
||||
<para>Prior to HBase 0.94.x, HBase expected the loopback IP address to be 127.0.0.1. Ubuntu
|
||||
and some other distributions default to 127.0.1.1 and this will cause problems for you . See <link
|
||||
xlink:href="http://blog.devving.com/why-does-hbase-care-about-etchosts/">Why does HBase
|
||||
care about /etc/hosts?</link> for detail.</para>
|
||||
<example>
|
||||
<title>Example /etc/hosts File for Ubuntu</title>
|
||||
<para>The following <filename>/etc/hosts</filename> file works correctly for HBase 0.94.x
|
||||
and earlier, on Ubuntu. Use this as a template if you run into trouble.</para>
|
||||
<screen>
|
||||
127.0.0.1 localhost
|
||||
127.0.0.1 ubuntu.ubuntu-domain ubuntu
|
||||
</screen>
|
||||
|
||||
</screen>
|
||||
</example>
|
||||
</note>
|
||||
|
||||
<section>
|
||||
|
@ -89,159 +94,611 @@
|
|||
</section>
|
||||
|
||||
<section>
|
||||
<title>Download and unpack the latest stable release.</title>
|
||||
<title>Get Started with HBase</title>
|
||||
|
||||
<para>Choose a download site from this list of <link
|
||||
<procedure>
|
||||
<title>Download, Configure, and Start HBase</title>
|
||||
<step>
|
||||
<para>Choose a download site from this list of <link
|
||||
xlink:href="http://www.apache.org/dyn/closer.cgi/hbase/">Apache Download Mirrors</link>.
|
||||
Click on the suggested top link. This will take you to a mirror of <emphasis>HBase
|
||||
Releases</emphasis>. Click on the folder named <filename>stable</filename> and then
|
||||
download the file that ends in <filename>.tar.gz</filename> to your local filesystem; e.g.
|
||||
<filename>hbase-0.94.2.tar.gz</filename>.</para>
|
||||
|
||||
<para>Decompress and untar your download and then change into the unpacked directory.</para>
|
||||
|
||||
<screen><![CDATA[$ tar xfz hbase-<?eval ${project.version}?>.tar.gz
|
||||
$ cd hbase-<?eval ${project.version}?>]]>
|
||||
</screen>
|
||||
|
||||
<para>At this point, you are ready to start HBase. But before starting it, edit
|
||||
<filename>conf/hbase-site.xml</filename>, the file you write your site-specific
|
||||
configurations into. Set <varname>hbase.rootdir</varname>, the directory HBase writes data
|
||||
to, and <varname>hbase.zookeeper.property.dataDir</varname>, the directory ZooKeeper writes
|
||||
its data too:</para>
|
||||
<programlisting><![CDATA[<?xml version="1.0"?>
|
||||
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
|
||||
download the binary file that ends in <filename>.tar.gz</filename> to your local filesystem. Be
|
||||
sure to choose the version that corresponds with the version of Hadoop you are likely to use
|
||||
later. In most cases, you should choose the file for Hadoop 2, which will be called something
|
||||
like <filename>hbase-0.98.3-hadoop2-bin.tar.gz</filename>. Do not download the file ending in
|
||||
<filename>src.tar.gz</filename> for now.</para>
|
||||
</step>
|
||||
<step>
|
||||
<para>Extract the downloaded file, and change to the newly-created directory.</para>
|
||||
<screen>
|
||||
$ tar xzvf hbase-<![CDATA[<?eval ${project.version}?>]]>-hadoop2-bin.tar.gz
|
||||
$ cd hbase-<![CDATA[<?eval ${project.version}?>]]>-hadoop2/
|
||||
</screen>
|
||||
</step>
|
||||
<step>
|
||||
<para>Edit <filename>conf/hbase-site.xml</filename>, which is the main HBase configuration
|
||||
file. At this time, you only need to specify the directory on the local filesystem where
|
||||
HBase and Zookeeper write data. By default, a new directory is created under /tmp. Many
|
||||
servers are configured to delete the contents of /tmp upon reboot, so you should store
|
||||
the data elsewhere. The following configuration will store HBase's data in the
|
||||
<filename>hbase</filename> directory, in the home directory of the user called
|
||||
<systemitem>testuser</systemitem>. Paste the <markup><property></markup> tags beneath the
|
||||
<markup><configuration></markup> tags, which should be empty in a new HBase install.</para>
|
||||
<example>
|
||||
<title>Example <filename>hbase-site.xml</filename> for Standalone HBase</title>
|
||||
<programlisting><![CDATA[
|
||||
<configuration>
|
||||
<property>
|
||||
<name>hbase.rootdir</name>
|
||||
<value>file:///DIRECTORY/hbase</value>
|
||||
<value>file:///home/testuser/hbase</value>
|
||||
</property>
|
||||
<property>
|
||||
<name>hbase.zookeeper.property.dataDir</name>
|
||||
<value>/DIRECTORY/zookeeper</value>
|
||||
<value>/home/testuser/zookeeper</value>
|
||||
</property>
|
||||
</configuration>]]></programlisting>
|
||||
<para> Replace <varname>DIRECTORY</varname> in the above with the path to the directory you
|
||||
would have HBase and ZooKeeper write their data. By default,
|
||||
<varname>hbase.rootdir</varname> is set to <filename>/tmp/hbase-${user.name}</filename>
|
||||
and similarly so for the default ZooKeeper data location which means you'll lose all your
|
||||
data whenever your server reboots unless you change it (Most operating systems clear
|
||||
<filename>/tmp</filename> on restart).</para>
|
||||
</section>
|
||||
</configuration>
|
||||
]]>
|
||||
</programlisting>
|
||||
</example>
|
||||
<para>You do not need to create the HBase data directory. HBase will do this for you. If
|
||||
you create the directory, HBase will attempt to do a migration, which is not what you
|
||||
want.</para>
|
||||
</step>
|
||||
<step xml:id="start_hbase">
|
||||
<para>The <filename>bin/start-hbase.sh</filename> script is provided as a convenient way
|
||||
to start HBase. Issue the command, and if all goes well, a message is logged to standard
|
||||
output showing that HBase started successfully. You can use the <command>jps</command>
|
||||
command to verify that you have one running process called <literal>HMaster</literal>
|
||||
and at least one called <literal>HRegionServer</literal>.</para>
|
||||
<note><para>Java needs to be installed and available. If you get an error indicating that
|
||||
Java is not installed, but it is on your system, perhaps in a non-standard location,
|
||||
edit the <filename>conf/hbase-env.sh</filename> file and modify the
|
||||
<envar>JAVA_HOME</envar> setting to point to the directory that contains
|
||||
<filename>bin/java</filename> your system.</para></note>
|
||||
</step>
|
||||
</procedure>
|
||||
|
||||
<section
|
||||
xml:id="start_hbase">
|
||||
<title>Start HBase</title>
|
||||
|
||||
<para>Now start HBase:</para>
|
||||
<screen>$ ./bin/start-hbase.sh
|
||||
starting Master, logging to logs/hbase-user-master-example.org.out</screen>
|
||||
|
||||
<para>You should now have a running standalone HBase instance. In standalone mode, HBase runs
|
||||
all daemons in the the one JVM; i.e. both the HBase and ZooKeeper daemons. HBase logs can be
|
||||
found in the <filename>logs</filename> subdirectory. Check them out especially if it seems
|
||||
HBase had trouble starting.</para>
|
||||
|
||||
<note>
|
||||
<title>Is <application>java</application> installed?</title>
|
||||
|
||||
<para>All of the above presumes a 1.6 version of Oracle <application>java</application> is
|
||||
installed on your machine and available on your path (See <xref
|
||||
linkend="java" />); i.e. when you type <application>java</application>, you see output
|
||||
that describes the options the java program takes (HBase requires java 6). If this is not
|
||||
the case, HBase will not start. Install java, edit <filename>conf/hbase-env.sh</filename>,
|
||||
uncommenting the <envar>JAVA_HOME</envar> line pointing it to your java install, then,
|
||||
retry the steps above.</para>
|
||||
</note>
|
||||
</section>
|
||||
|
||||
<section
|
||||
xml:id="shell_exercises">
|
||||
<title>Shell Exercises</title>
|
||||
|
||||
<para>Connect to your running HBase via the <command>shell</command>.</para>
|
||||
|
||||
<screen><![CDATA[$ ./bin/hbase shell
|
||||
HBase Shell; enter 'help<RETURN>' for list of supported commands.
|
||||
Type "exit<RETURN>" to leave the HBase Shell
|
||||
Version: 0.90.0, r1001068, Fri Sep 24 13:55:42 PDT 2010
|
||||
|
||||
hbase(main):001:0>]]> </screen>
|
||||
|
||||
<para>Type <command>help</command> and then <command><RETURN></command> to see a listing
|
||||
of shell commands and options. Browse at least the paragraphs at the end of the help
|
||||
emission for the gist of how variables and command arguments are entered into the HBase
|
||||
shell; in particular note how table names, rows, and columns, etc., must be quoted.</para>
|
||||
|
||||
<para>Create a table named <varname>test</varname> with a single column family named
|
||||
<varname>cf</varname>. Verify its creation by listing all tables and then insert some
|
||||
values.</para>
|
||||
|
||||
<screen><![CDATA[hbase(main):003:0> create 'test', 'cf'
|
||||
<procedure xml:id="shell_exercises">
|
||||
<title>Use HBase For the First Time</title>
|
||||
<step>
|
||||
<title>Connect to HBase.</title>
|
||||
<para>Connect to your running instance of HBase using the <command>hbase shell</command>
|
||||
command, located in the <filename>bin/</filename> directory of your HBase
|
||||
install. In this example, some usage and version information that is printed when you
|
||||
start HBase Shell has been omitted. The HBase Shell prompt ends with a
|
||||
<literal>></literal> character.</para>
|
||||
<screen>
|
||||
$ <userinput>./bin/hbase shell</userinput>
|
||||
hbase(main):001:0>
|
||||
</screen>
|
||||
</step>
|
||||
<step>
|
||||
<title>Display HBase Shell Help Text.</title>
|
||||
<para>Type <literal>help</literal> and press Enter, to display some basic usage
|
||||
information for HBase Shell, as well as several example commands. Notice that table
|
||||
names, rows, columns all must be enclosed in quote characters.</para>
|
||||
</step>
|
||||
<step>
|
||||
<title>Create a table.</title>
|
||||
<para>Use the <code>create</code> command to create a new table. You must specify the
|
||||
table name and the ColumnFamily name.</para>
|
||||
<screen>
|
||||
hbase> <userinput>create 'test', 'cf'</userinput>
|
||||
0 row(s) in 1.2200 seconds
|
||||
hbase(main):003:0> list 'test'
|
||||
..
|
||||
1 row(s) in 0.0550 seconds
|
||||
hbase(main):004:0> put 'test', 'row1', 'cf:a', 'value1'
|
||||
0 row(s) in 0.0560 seconds
|
||||
hbase(main):005:0> put 'test', 'row2', 'cf:b', 'value2'
|
||||
0 row(s) in 0.0370 seconds
|
||||
hbase(main):006:0> put 'test', 'row3', 'cf:c', 'value3'
|
||||
0 row(s) in 0.0450 seconds]]></screen>
|
||||
</screen>
|
||||
</step>
|
||||
<step>
|
||||
<title>List Information About your Table</title>
|
||||
<para>Use the <code>list</code> command to </para>
|
||||
<screen>
|
||||
hbase> <userinput>list 'test'</userinput>
|
||||
TABLE
|
||||
test
|
||||
1 row(s) in 0.0350 seconds
|
||||
|
||||
<para>Above we inserted 3 values, one at a time. The first insert is at
|
||||
<varname>row1</varname>, column <varname>cf:a</varname> with a value of
|
||||
<varname>value1</varname>. Columns in HBase are comprised of a column family prefix --
|
||||
<varname>cf</varname> in this example -- followed by a colon and then a column qualifier
|
||||
suffix (<varname>a</varname> in this case).</para>
|
||||
=> ["test"]
|
||||
</screen>
|
||||
</step>
|
||||
<step>
|
||||
<title>Put data into your table.</title>
|
||||
<para>To put data into your table, use the <code>put</code> command.</para>
|
||||
<screen>
|
||||
hbase> <userinput>put 'test', 'row1', 'cf:a', 'value1'</userinput>
|
||||
0 row(s) in 0.1770 seconds
|
||||
|
||||
<para>Verify the data insert by running a scan of the table as follows</para>
|
||||
hbase> <userinput>put 'test', 'row2', 'cf:b', 'value2'</userinput>
|
||||
0 row(s) in 0.0160 seconds
|
||||
|
||||
<screen><![CDATA[hbase(main):007:0> scan 'test'
|
||||
ROW COLUMN+CELL
|
||||
row1 column=cf:a, timestamp=1288380727188, value=value1
|
||||
row2 column=cf:b, timestamp=1288380738440, value=value2
|
||||
row3 column=cf:c, timestamp=1288380747365, value=value3
|
||||
3 row(s) in 0.0590 seconds]]></screen>
|
||||
hbase> <userinput>put 'test', 'row3', 'cf:c', 'value3'</userinput>
|
||||
0 row(s) in 0.0260 seconds
|
||||
</screen>
|
||||
<para>Here, we insert three values, one at a time. The first insert is at
|
||||
<literal>row1</literal>, column <literal>cf:a</literal>, with a value of
|
||||
<literal>value1</literal>. Columns in HBase are comprised of a column family prefix,
|
||||
<literal>cf</literal> in this example, followed by a colon and then a column qualifier
|
||||
suffix, <literal>a</literal> in this case.</para>
|
||||
</step>
|
||||
<step>
|
||||
<title>Scan the table for all data at once.</title>
|
||||
<para>One of the ways to get data from HBase is to scan. Use the <command>scan</command>
|
||||
command to scan the table for data. You can limit your scan, but for now, all data is
|
||||
fetched.</para>
|
||||
<screen>
|
||||
hbase> <userinput>scan 'test'</userinput>
|
||||
ROW COLUMN+CELL
|
||||
row1 column=cf:a, timestamp=1403759475114, value=value1
|
||||
row2 column=cf:b, timestamp=1403759492807, value=value2
|
||||
row3 column=cf:c, timestamp=1403759503155, value=value3
|
||||
3 row(s) in 0.0440 seconds
|
||||
</screen>
|
||||
</step>
|
||||
<step>
|
||||
<title>Get a single row of data.</title>
|
||||
<para>To get a single row of data at a time, use the <command>get</command> command.</para>
|
||||
<screen>
|
||||
hbase> <userinput>get 'test', 'row1'</userinput>
|
||||
COLUMN CELL
|
||||
cf:a timestamp=1403759475114, value=value1
|
||||
1 row(s) in 0.0230 seconds
|
||||
</screen>
|
||||
</step>
|
||||
<step>
|
||||
<title>Disable a table.</title>
|
||||
<para>If you want to delete a table or change its settings, as well as in some other
|
||||
situations, you need to disable the table first, using the <code>disable</code>
|
||||
command. You can re-enable it using the <code>enable</code> command.</para>
|
||||
<screen>
|
||||
hbase> disable 'test'
|
||||
0 row(s) in 1.6270 seconds
|
||||
|
||||
<para>Get a single row</para>
|
||||
|
||||
<screen><![CDATA[hbase(main):008:0> get 'test', 'row1'
|
||||
COLUMN CELL
|
||||
cf:a timestamp=1288380727188, value=value1
|
||||
1 row(s) in 0.0400 seconds]]></screen>
|
||||
|
||||
<para>Now, disable and drop your table. This will clean up all done above.</para>
|
||||
|
||||
<screen>h<![CDATA[base(main):012:0> disable 'test'
|
||||
0 row(s) in 1.0930 seconds
|
||||
hbase(main):013:0> drop 'test'
|
||||
0 row(s) in 0.0770 seconds ]]></screen>
|
||||
|
||||
<para>Exit the shell by typing exit.</para>
|
||||
|
||||
<programlisting><![CDATA[hbase(main):014:0> exit]]></programlisting>
|
||||
hbase> enable 'test'
|
||||
0 row(s) in 0.4500 seconds
|
||||
</screen>
|
||||
<para>Disable the table again if you tested the <command>enable</command> command above:</para>
|
||||
<screen>
|
||||
hbase> disable 'test'
|
||||
0 row(s) in 1.6270 seconds
|
||||
</screen>
|
||||
</step>
|
||||
<step>
|
||||
<title>Drop the table.</title>
|
||||
<para>To drop (delete) a table, use the <code>drop</code> command.</para>
|
||||
<screen>
|
||||
hbase> drop 'test'
|
||||
0 row(s) in 0.2900 seconds
|
||||
</screen>
|
||||
</step>
|
||||
<step>
|
||||
<title>Exit the HBase Shell.</title>
|
||||
<para>To exit the HBase Shell and disconnect from your cluster, use the
|
||||
<command>quit</command> command. HBase is still running in the background.</para>
|
||||
</step>
|
||||
</procedure>
|
||||
|
||||
<procedure
|
||||
xml:id="stopping">
|
||||
<title>Stop HBase</title>
|
||||
<step>
|
||||
<para>In the same way that the <filename>bin/start-hbase.sh</filename> script is provided
|
||||
to conveniently start all HBase daemons, the <filename>bin/stop-hbase.sh</filename>
|
||||
script stops them.</para>
|
||||
<screen>
|
||||
$ ./bin/stop-hbase.sh
|
||||
stopping hbase....................
|
||||
$
|
||||
</screen>
|
||||
</step>
|
||||
<step>
|
||||
<para>After issuing the command, it can take several minutes for the processes to shut
|
||||
down. Use the <command>jps</command> to be sure that the HMaster and HRegionServer
|
||||
processes are shut down.</para>
|
||||
</step>
|
||||
</procedure>
|
||||
</section>
|
||||
|
||||
<section
|
||||
xml:id="stopping">
|
||||
<title>Stopping HBase</title>
|
||||
|
||||
<para>Stop your hbase instance by running the stop script.</para>
|
||||
|
||||
<screen>$ ./bin/stop-hbase.sh
|
||||
stopping hbase...............</screen>
|
||||
<section xml:id="quickstart-pseudo">
|
||||
<title>Intermediate - Pseudo-Distributed Local Install</title>
|
||||
<para>After working your way through <xref linkend="quickstart" />, you can re-configure HBase
|
||||
to run in pseudo-distributed mode. Pseudo-distributed mode means
|
||||
that HBase still runs completely on a single host, but each HBase daemon (HMaster,
|
||||
HRegionServer, and Zookeeper) runs as a separate process. By default, unless you configure the
|
||||
<code>hbase.rootdir</code> property as described in <xref linkend="quickstart" />, your data
|
||||
is still stored in <filename>/tmp/</filename>. In this walk-through, we store your data in
|
||||
HDFS instead, assuming you have HDFS available. You can skip the HDFS configuration to
|
||||
continue storing your data in the local filesystem.</para>
|
||||
<note>
|
||||
<title>Hadoop Configuration</title>
|
||||
<para>This procedure assumes that you have configured Hadoop and HDFS on your local system
|
||||
and or a remote system, and that they are running and available. It also assumes you are
|
||||
using Hadoop 2. Currently, the documentation on the Hadoop website does not include a
|
||||
quick start for Hadoop 2, but the guide at <link
|
||||
xlink:href="http://www.alexjf.net/blog/distributed-systems/hadoop-yarn-installation-definitive-guide">http://www.alexjf.net/blog/distributed-systems/hadoop-yarn-installation-definitive-guide</link>
|
||||
is a good starting point.</para>
|
||||
</note>
|
||||
<procedure>
|
||||
<step>
|
||||
<title>Stop HBase if it is running.</title>
|
||||
<para>If you have just finished <xref linkend="quickstart" /> and HBase is still running,
|
||||
stop it. This procedure will create a totally new directory where HBase will store its
|
||||
data, so any databases you created before will be lost.</para>
|
||||
</step>
|
||||
<step>
|
||||
<title>Configure HBase.</title>
|
||||
<para>
|
||||
Edit the <filename>hbase-site.xml</filename> configuration. First, add the following
|
||||
property, which directs HBase to run in distributed mode, with one JVM instance per
|
||||
daemon.
|
||||
</para>
|
||||
<programlisting><![CDATA[
|
||||
<property>
|
||||
<name>hbase.cluster.distributed</name>
|
||||
<value>true</value>
|
||||
</property>
|
||||
]]></programlisting>
|
||||
<para>Next, change the <code>hbase.rootdir</code> from the local filesystem to the address
|
||||
of your HDFS instance, using the <code>hdfs:////</code> URI syntax. In this example,
|
||||
HDFS is running on the localhost at port 8020.</para>
|
||||
<programlisting><![CDATA[
|
||||
<property>
|
||||
<name>hbase.rootdir</name>
|
||||
<value>hdfs://localhost:8020/hbase</value>
|
||||
</property>
|
||||
]]>
|
||||
</programlisting>
|
||||
<para>You do not need to create the directory in HDFS. HBase will do this for you. If you
|
||||
create the directory, HBase will attempt to do a migration, which is not what you
|
||||
want.</para>
|
||||
</step>
|
||||
<step>
|
||||
<title>Start HBase.</title>
|
||||
<para>Use the <filename>bin/start-hbase.sh</filename> command to start HBase. If your
|
||||
system is configured correctly, the <command>jps</command> command should show the
|
||||
HMaster and HRegionServer processes running.</para>
|
||||
</step>
|
||||
<step>
|
||||
<title>Check the HBase directory in HDFS.</title>
|
||||
<para>If everything worked correctly, HBase created its directory in HDFS. In the
|
||||
configuration above, it is stored in <filename>/hbase/</filename> on HDFS. You can use
|
||||
the <command>hadoop fs</command> command in Hadoop's <filename>bin/</filename> directory
|
||||
to list this directory.</para>
|
||||
<screen>
|
||||
$ <userinput>./bin/hadoop fs -ls /hbase</userinput>
|
||||
Found 7 items
|
||||
drwxr-xr-x - hbase users 0 2014-06-25 18:58 /hbase/.tmp
|
||||
drwxr-xr-x - hbase users 0 2014-06-25 21:49 /hbase/WALs
|
||||
drwxr-xr-x - hbase users 0 2014-06-25 18:48 /hbase/corrupt
|
||||
drwxr-xr-x - hbase users 0 2014-06-25 18:58 /hbase/data
|
||||
-rw-r--r-- 3 hbase users 42 2014-06-25 18:41 /hbase/hbase.id
|
||||
-rw-r--r-- 3 hbase users 7 2014-06-25 18:41 /hbase/hbase.version
|
||||
drwxr-xr-x - hbase users 0 2014-06-25 21:49 /hbase/oldWALs
|
||||
</screen>
|
||||
</step>
|
||||
<step>
|
||||
<title>Create a table and populate it with data.</title>
|
||||
<para>You can use the HBase Shell to create a table, populate it with data, scan and get
|
||||
values from it, using the same procedure as in <xref linkend="shell_exercises" />.</para>
|
||||
</step>
|
||||
<step>
|
||||
<title>Start and stop a backup HBase Master (HMaster) server.</title>
|
||||
<note>
|
||||
<para>Running multiple HMaster instances on the same hardware does not make sense in a
|
||||
production environment, in the same way that running a pseudo-distributed cluster does
|
||||
not make sense for production. This step is offered for testing and learning purposes
|
||||
only.</para>
|
||||
</note>
|
||||
<para>The HMaster server controls the HBase cluster. You can start up to 9 backup HMaster
|
||||
servers, which makes 10 total HMasters, counting the primary. To start a backup HMaster,
|
||||
use the <command>local-master-backup.sh</command>. For each backup master you want to
|
||||
start, add a parameter representing the port offset for that master. Each HMaster uses
|
||||
two ports (16000 and 16010 by default). The port offset is added to these ports, so
|
||||
using an offset of 2, the first backup HMaster would use ports 16002 and 16012. The
|
||||
following command starts 3 backup servers using ports 16002/16012, 16003/16013, and
|
||||
16005/16015.</para>
|
||||
<screen>
|
||||
$ ./bin/local-master-backup.sh 2 3 5
|
||||
</screen>
|
||||
<para>To kill a backup master without killing the entire cluster, you need to find its
|
||||
process ID (PID). The PID is stored in a file with a name like
|
||||
<filename>/tmp/hbase-<replaceable>USER</replaceable>-<replaceable>X</replaceable>-master.pid</filename>.
|
||||
The only contents of the file are the PID. You can use the <command>kill -9</command>
|
||||
command to kill that PID. The following command will kill the master with port offset 1,
|
||||
but leave the cluster running:</para>
|
||||
<screen>
|
||||
$ cat /tmp/hbase-testuser-1-master.pid |xargs kill -9
|
||||
</screen>
|
||||
</step>
|
||||
<step>
|
||||
<title>Start and stop additional RegionServers</title>
|
||||
<para>The HRegionServer manages the data in its StoreFiles as directed by the HMaster.
|
||||
Generally, one HRegionServer runs per node in the cluster. Running multiple
|
||||
HRegionServers on the same system can be useful for testing in pseudo-distributed mode.
|
||||
The <command>local-regionservers.sh</command> command allows you to run multiple
|
||||
RegionServers. It works in a similar way to the
|
||||
<command>local-master-backup.sh</command> command, in that each parameter you provide
|
||||
represents the port offset for an instance. Each RegionServer requires two ports, and
|
||||
the default ports are 16200 and 16300. You can run 99 additional RegionServers, or 100
|
||||
total, on a server. The following command starts four additional
|
||||
RegionServers, running on sequential ports starting at 16202/16302.</para>
|
||||
<screen>
|
||||
$ .bin/local-regionservers.sh start 2 3 4 5
|
||||
</screen>
|
||||
<para>To stop a RegionServer manually, use the <command>local-regionservers.sh</command>
|
||||
command with the <literal>stop</literal> parameter and the offset of the server to
|
||||
stop.</para>
|
||||
<screen>$ .bin/local-regionservers.sh stop 3</screen>
|
||||
</step>
|
||||
<step>
|
||||
<title>Stop HBase.</title>
|
||||
<para>You can stop HBase the same way as in the <xref
|
||||
linkend="quickstart" /> procedure, using the
|
||||
<filename>bin/stop-hbase.sh</filename> command.</para>
|
||||
</step>
|
||||
</procedure>
|
||||
</section>
|
||||
|
||||
<section xml:id="quickstart-fully-distributed">
|
||||
<title>Advanced - Fully Distributed</title>
|
||||
<para>In reality, you need a fully-distributed configuration to fully test HBase and to use it
|
||||
in real-world scenarios. In a distributed configuration, the cluster contains multiple
|
||||
nodes, each of which runs one or more HBase daemon. These include primary and backup Master
|
||||
instances, multiple Zookeeper nodes, and multiple RegionServer nodes.</para>
|
||||
<para>This advanced quickstart adds two more nodes to your cluster. The architecture will be
|
||||
as follows:</para>
|
||||
<table>
|
||||
<title>Distributed Cluster Demo Architecture</title>
|
||||
<tgroup cols="4">
|
||||
<thead>
|
||||
<row>
|
||||
<entry>Node Name</entry>
|
||||
<entry>Master</entry>
|
||||
<entry>ZooKeeper</entry>
|
||||
<entry>RegionServer</entry>
|
||||
</row>
|
||||
</thead>
|
||||
<tbody>
|
||||
<row>
|
||||
<entry>node-a.example.com</entry>
|
||||
<entry>yes</entry>
|
||||
<entry>yes</entry>
|
||||
<entry>no</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>node-b.example.com</entry>
|
||||
<entry>backup</entry>
|
||||
<entry>yes</entry>
|
||||
<entry>yes</entry>
|
||||
</row>
|
||||
<row>
|
||||
<entry>node-c.example.com</entry>
|
||||
<entry>no</entry>
|
||||
<entry>yes</entry>
|
||||
<entry>yes</entry>
|
||||
</row>
|
||||
</tbody>
|
||||
</tgroup>
|
||||
</table>
|
||||
<para>This quickstart assumes that each node is a virtual machine and that they are all on the
|
||||
same network. It builds upon the previous quickstart, <xref linkend="quickstart-pseudo" />,
|
||||
assuming that the system you configured in that procedure is now <code>node-a</code>. Stop HBase on <code>node-a</code>
|
||||
before continuing.</para>
|
||||
<note>
|
||||
<para>Be sure that all the nodes have full access to communicate, and that no firewall rules
|
||||
are in place which could prevent them from talking to each other. If you see any errors like
|
||||
<literal>no route to host</literal>, check your firewall.</para>
|
||||
</note>
|
||||
<procedure xml:id="passwordless.ssh.quickstart">
|
||||
<title>Configure Password-Less SSH Access</title>
|
||||
<para><code>node-a</code> needs to be able to log into <code>node-b</code> and
|
||||
<code>node-c</code> (and to itself) in order to start the daemons. The easiest way to accomplish this is
|
||||
to use the same username on all hosts, and configure password-less SSH login from
|
||||
<code>node-a</code> to each of the others. </para>
|
||||
<step>
|
||||
<title>On <code>node-a</code>, generate a key pair.</title>
|
||||
<para>While logged in as the user who will run HBase, generate a SSH key pair, using the
|
||||
following command:
|
||||
</para>
|
||||
<screen>$ ssh-keygen -t rsa</screen>
|
||||
<para>If the command succeeds, the location of the key pair is printed to standard output.
|
||||
The default name of the public key is <filename>id_rsa.pub</filename>.</para>
|
||||
</step>
|
||||
<step>
|
||||
<title>Create the directory that will hold the shared keys on the other nodes.</title>
|
||||
<para>On <code>node-b</code> and <code>node-c</code>, log in as the HBase user and create
|
||||
a <filename>.ssh/</filename> directory in the user's home directory, if it does not
|
||||
already exist. If it already exists, be aware that it may already contain other keys.</para>
|
||||
</step>
|
||||
<step>
|
||||
<title>Copy the public key to the other nodes.</title>
|
||||
<para>Securely copy the public key from <code>node-a</code> to each of the nodes, by
|
||||
using the <command>scp</command> or some other secure means. On each of the other nodes,
|
||||
create a new file called <filename>.ssh/authorized_keys</filename> <emphasis>if it does
|
||||
not already exist</emphasis>, and append the contents of the
|
||||
<filename>id_rsa.pub</filename> file to the end of it. Note that you also need to do
|
||||
this for <code>node-a</code> itself.</para>
|
||||
<screen>$ cat id_rsa.pub >> ~/.ssh/authorized_keys</screen>
|
||||
</step>
|
||||
<step>
|
||||
<title>Test password-less login.</title>
|
||||
<para>If you performed the procedure correctly, if you SSH from <code>node-a</code> to
|
||||
either of the other nodes, using the same username, you should not be prompted for a password.
|
||||
</para>
|
||||
</step>
|
||||
<step>
|
||||
<para>Since <code>node-b</code> will run a backup Master, repeat the procedure above,
|
||||
substituting <code>node-b</code> everywhere you see <code>node-a</code>. Be sure not to
|
||||
overwrite your existing <filename>.ssh/authorized_keys</filename> files, but concatenate
|
||||
the new key onto the existing file using the <code>>></code> operator rather than
|
||||
the <code>></code> operator.</para>
|
||||
</step>
|
||||
</procedure>
|
||||
|
||||
<procedure>
|
||||
<title>Prepare <code>node-a</code></title>
|
||||
<para><code>node-a</code> will run your primary master and ZooKeeper processes, but no
|
||||
RegionServers.</para>
|
||||
<step>
|
||||
<title>Stop the RegionServer from starting on <code>node-a</code>.</title>
|
||||
<para>Edit <filename>conf/regionservers</filename> and remove the line which contains
|
||||
<literal>localhost</literal>. Add lines with the hostnames or IP addresses for
|
||||
<code>node-b</code> and <code>node-c</code>. Even if you did want to run a
|
||||
RegionServer on <code>node-a</code>, you should refer to it by the hostname the other
|
||||
servers would use to communicate with it. In this case, that would be
|
||||
<literal>node-a.example.com</literal>. This enables you to distribute the
|
||||
configuration to each node of your cluster any hostname conflicts. Save the file.</para>
|
||||
</step>
|
||||
<step>
|
||||
<title>Configure HBase to use <code>node-b</code> as a backup master.</title>
|
||||
<para>Create a new file in <filename>conf/</filename> called
|
||||
<filename>backup-masters</filename>, and add a new line to it with the hostname for
|
||||
<code>node-b</code>. In this demonstration, the hostname is
|
||||
<literal>node-b.example.com</literal>.</para>
|
||||
</step>
|
||||
<step>
|
||||
<title>Configure ZooKeeper</title>
|
||||
<para>In reality, you should carefully consider your ZooKeeper configuration. You can find
|
||||
out more about configuring ZooKeeper in <xref
|
||||
linkend="zookeeper" />. This configuration will direct HBase to start and manage a
|
||||
ZooKeeper instance on each node of the cluster.</para>
|
||||
<para>On <code>node-a</code>, edit <filename>conf/hbase-site.xml</filename> and add the
|
||||
following properties.</para>
|
||||
<programlisting><![CDATA[
|
||||
<property>
|
||||
<name>hbase.zookeeper.quorum</name>
|
||||
<value>node-a.example.com,node-b.example.com,node-c.example.com</value>
|
||||
</property>
|
||||
<property>
|
||||
<name>hbase.zookeeper.property.dataDir</name>
|
||||
<value>/usr/local/zookeeper</value>
|
||||
</property>
|
||||
]]></programlisting>
|
||||
</step>
|
||||
<step>
|
||||
<para>Everywhere in your configuration that you have referred to <code>node-a</code> as
|
||||
<literal>localhost</literal>, change the reference to point to the hostname that
|
||||
the other nodes will use to refer to <code>node-a</code>. In these examples, the
|
||||
hostname is <literal>node-a.example.com</literal>.</para>
|
||||
</step>
|
||||
</procedure>
|
||||
<procedure>
|
||||
<title>Prepare <code>node-b</code> and <code>node-c</code></title>
|
||||
<para><code>node-b</code> will run a backup master server and a ZooKeeper instance.</para>
|
||||
<step>
|
||||
<title>Download and unpack HBase.</title>
|
||||
<para>Download and unpack HBase to <code>node-b</code>, just as you did for the standalone
|
||||
and pseudo-distributed quickstarts.</para>
|
||||
</step>
|
||||
<step>
|
||||
<title>Copy the configuration files from <code>node-a</code> to <code>node-b</code>.and
|
||||
<code>node-c</code>.</title>
|
||||
<para>Each node of your cluster needs to have the same configuration information. Copy the
|
||||
contents of the <filename>conf/</filename> directory to the <filename>conf/</filename>
|
||||
directory on <code>node-b</code> and <code>node-c</code>.</para>
|
||||
</step>
|
||||
</procedure>
|
||||
|
||||
<procedure>
|
||||
<title>Start and Test Your Cluster</title>
|
||||
<step>
|
||||
<title>Be sure HBase is not running on any node.</title>
|
||||
<para>If you forgot to stop HBase from previous testing, you will have errors. Check to
|
||||
see whether HBase is running on any of your nodes by using the <command>jps</command>
|
||||
command. Look for the processes <literal>HMaster</literal>,
|
||||
<literal>HRegionServer</literal>, and <literal>HQuorumPeer</literal>. If they exist,
|
||||
kill them.</para>
|
||||
</step>
|
||||
<step>
|
||||
<title>Start the cluster.</title>
|
||||
<para>On <code>node-a</code>, issue the <command>start-hbase.sh</command> command. Your
|
||||
output will be similar to that below.</para>
|
||||
<screen>
|
||||
$ <userinput>bin/start-hbase.sh</userinput>
|
||||
node-c.example.com: starting zookeeper, logging to /home/hbuser/hbase-0.98.3-hadoop2/bin/../logs/hbase-hbuser-zookeeper-node-c.example.com.out
|
||||
node-a.example.com: starting zookeeper, logging to /home/hbuser/hbase-0.98.3-hadoop2/bin/../logs/hbase-hbuser-zookeeper-node-a.example.com.out
|
||||
node-b.example.com: starting zookeeper, logging to /home/hbuser/hbase-0.98.3-hadoop2/bin/../logs/hbase-hbuser-zookeeper-node-b.example.com.out
|
||||
starting master, logging to /home/hbuser/hbase-0.98.3-hadoop2/bin/../logs/hbase-hbuser-master-node-a.example.com.out
|
||||
node-c.example.com: starting regionserver, logging to /home/hbuser/hbase-0.98.3-hadoop2/bin/../logs/hbase-hbuser-regionserver-node-c.example.com.out
|
||||
node-b.example.com: starting regionserver, logging to /home/hbuser/hbase-0.98.3-hadoop2/bin/../logs/hbase-hbuser-regionserver-node-b.example.com.out
|
||||
node-b.example.com: starting master, logging to /home/hbuser/hbase-0.98.3-hadoop2/bin/../logs/hbase-hbuser-master-nodeb.example.com.out
|
||||
</screen>
|
||||
<para>ZooKeeper starts first, followed by the master, then the RegionServers, and finally
|
||||
the backup masters. </para>
|
||||
</step>
|
||||
<step>
|
||||
<title>Verify that the processes are running.</title>
|
||||
<para>On each node of the cluster, run the <command>jps</command> command and verify that
|
||||
the correct processes are running on each server. You may see additional Java processes
|
||||
running on your servers as well, if they are used for other purposes.</para>
|
||||
<example>
|
||||
<title><code>node-a</code> <command>jps</command> Output</title>
|
||||
<screen>
|
||||
$ <userinput>jps</userinput>
|
||||
20355 Jps
|
||||
20071 HQuorumPeer
|
||||
20137 HMaster
|
||||
</screen>
|
||||
</example>
|
||||
<example>
|
||||
<title><code>node-b</code> <command>jps</command> Output</title>
|
||||
<screen>
|
||||
$ <userinput>jps</userinput>
|
||||
15930 HRegionServer
|
||||
16194 Jps
|
||||
15838 HQuorumPeer
|
||||
16010 HMaster
|
||||
</screen>
|
||||
</example>
|
||||
<example>
|
||||
<title><code>node-c</code> <command>jps</command> Output</title>
|
||||
<screen>
|
||||
$ <userinput>jps</userinput>
|
||||
13901 Jps
|
||||
13639 HQuorumPeer
|
||||
13737 HRegionServer
|
||||
</screen>
|
||||
</example>
|
||||
<note>
|
||||
<title>ZooKeeper Process Name</title>
|
||||
<para>The <code>HQuorumPeer</code> process is a ZooKeeper instance which is controlled
|
||||
and started by HBase. If you use ZooKeeper this way, it is limited to one instance per
|
||||
cluster node, , and is appropriate for testing only. If ZooKeeper is run outside of
|
||||
HBase, the process is called <code>QuorumPeer</code>. For more about ZooKeeper
|
||||
configuration, including using an external ZooKeeper instance with HBase, see <xref
|
||||
linkend="zookeeper" />.</para>
|
||||
</note>
|
||||
</step>
|
||||
<step>
|
||||
<title>Browse to the Web UI.</title>
|
||||
<note>
|
||||
<title>Web UI Port Changes</title>
|
||||
<para>In HBase newer than 0.98.x, the HTTP ports used by the HBase Web UI changed from
|
||||
60010 for the Master and 60030 for each RegionServer to 16610 for the Master and 16030
|
||||
for the RegionServer.</para>
|
||||
</note>
|
||||
<para>If everything is set up correctly, you should be able to connect to the UI for the
|
||||
Master <literal>http://node-a.example.com:60110/</literal> or the secondary master at
|
||||
<literal>http://node-b.example.com:60110/</literal> for the secondary master, using a
|
||||
web browser. If you can connect via <code>localhost</code> but not from another host,
|
||||
check your firewall rules. You can see the web UI for each of the RegionServers at port
|
||||
60130 of their IP addresses, or by clicking their links in the web UI for the
|
||||
Master.</para>
|
||||
</step>
|
||||
<step>
|
||||
<title>Test what happens when nodes or services disappear.</title>
|
||||
<para>With a three-node cluster like you have configured, things will not be very
|
||||
resilient. Still, you can test what happens when the primary Master or a RegionServer
|
||||
disappears, by killing the processes and watching the logs.</para>
|
||||
</step>
|
||||
</procedure>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<title>Where to go next</title>
|
||||
|
||||
<para>The above described standalone setup is good for testing and experiments only. In the
|
||||
next chapter, <xref
|
||||
linkend="configuration" />, we'll go into depth on the different HBase run modes, system
|
||||
requirements running HBase, and critical configurations setting up a distributed HBase
|
||||
deploy.</para>
|
||||
<para>The next chapter, <xref
|
||||
linkend="configuration" />, gives more information about the different HBase run modes,
|
||||
system requirements for running HBase, and critical configuration areas for setting up a
|
||||
distributed HBase cluster.</para>
|
||||
</section>
|
||||
</section>
|
||||
|
||||
</chapter>
|
||||
|
|
Loading…
Reference in New Issue