HADOOP-6616. Improve documentation for rack awareness. Contributed by Adam Faris.

git-svn-id: https://svn.apache.org/repos/asf/hadoop/common/trunk@1411359 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
Jakob Homan 2012-11-19 19:27:00 +00:00
parent c271f3cded
commit e9f01ac284
2 changed files with 135 additions and 17 deletions

View File

@ -132,6 +132,8 @@ Trunk (Unreleased)
HADOOP-9004. Allow security unit tests to use external KDC. (Stephen Chu
via suresh)
HADOOP-6616. Improve documentation for rack awareness. (Adam Faris via jghoman)
BUG FIXES
HADOOP-8177. MBeans shouldn't try to register when it fails to create MBeanName.

View File

@ -1292,23 +1292,139 @@
<section>
<title>Hadoop Rack Awareness</title>
<p>The HDFS and the Map/Reduce components are rack-aware.</p>
<p>The <code>NameNode</code> and the <code>JobTracker</code> obtains the
<code>rack id</code> of the slaves in the cluster by invoking an API
<a href="ext:api/org/apache/hadoop/net/dnstoswitchmapping/resolve
">resolve</a> in an administrator configured
module. The API resolves the slave's DNS name (also IP address) to a
rack id. What module to use can be configured using the configuration
item <code>net.topology.node.switch.mapping.impl</code>. The default
implementation of the same runs a script/command configured using
<code>net.topology.script.file.name</code>. If topology.script.file.name is
not set, the rack id <code>/default-rack</code> is returned for any
passed IP address. The additional configuration in the Map/Reduce
part is <code>mapred.cache.task.levels</code> which determines the number
of levels (in the network topology) of caches. So, for example, if it is
the default value of 2, two levels of caches will be constructed -
one for hosts (host -> task mapping) and another for racks
(rack -> task mapping).
<p>
Both HDFS and Map/Reduce components are rack-aware. HDFS block placement will use rack
awareness for fault tolerance by placing one block replica on a different rack. This provides
data availability in the event of a network switch failure within the cluster. The jobtracker uses rack
awareness to reduce network transfers of HDFS data blocks by attempting to schedule tasks on datanodes with a local
copy of needed HDFS blocks. If the tasks cannot be scheduled on the datanodes
containing the needed HDFS blocks, then the tasks will be scheduled on the same rack to reduce network transfers if possible.
</p>
<p>The NameNode and the JobTracker obtain the rack id of the cluster slaves by invoking either
an external script or java class as specified by configuration files. Using either the
java class or external script for topology, output must adhere to the java
<a href="ext:api/org/apache/hadoop/net/dnstoswitchmapping/resolve">DNSToSwitchMapping</a>
interface. The interface expects a one-to-one correspondence to be maintained
and the topology information in the format of '/myrack/myhost', where '/' is the topology
delimiter, 'myrack' is the rack identifier, and 'myhost' is the individual host. Assuming
a single /24 subnet per rack, one could use the format of '/192.168.100.0/192.168.100.5' as a
unique rack-host topology mapping.
</p>
<p>
To use the java class for topology mapping, the class name is specified by the
<code>'topology.node.switch.mapping.impl'</code> parameter in the configuration file.
An example, NetworkTopology.java, is included with the hadoop distribution and can be customized
by the hadoop administrator. If not included with your distribution, NetworkTopology.java can also be found in the Hadoop
<a href="http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/net/NetworkTopology.java?view=markup">
subversion tree</a>. Using a java class instead of an external script has a slight performance benefit in
that it doesn't need to fork an external process when a new slave node registers itself with the jobtracker or namenode.
As this class is only used during slave node registration, the performance benefit is limited.
</p>
<p>
If implementing an external script, it will be specified with the
<code>topology.script.file.name</code> parameter in the configuration files. Unlike the java
class, the external topology script is not included with the Hadoop distribution and is provided by the
administrator. Hadoop will send multiple IP addresses to ARGV when forking the topology script. The
number of IP addresses sent to the topology script is controlled with <code>net.topology.script.number.args</code>
and defaults to 100. If <code>net.topology.script.number.args</code> was changed to 1, a topology script would
get forked for each IP submitted by datanodes and/or tasktrackers. Below are example topology scripts.
</p>
<section>
<title>Python example</title>
<source>
<code>
#!/usr/bin/python
# this script makes assumptions about the physical environment.
# 1) each rack is its own layer 3 network with a /24 subnet, which could be typical where each rack has its own
# switch with uplinks to a central core router.
#
# +-----------+
# |core router|
# +-----------+
# / \
# +-----------+ +-----------+
# |rack switch| |rack switch|
# +-----------+ +-----------+
# | data node | | data node |
# +-----------+ +-----------+
# | data node | | data node |
# +-----------+ +-----------+
#
# 2) topology script gets list of IP's as input, calculates network address, and prints '/network_address/ip'.
import netaddr
import sys
sys.argv.pop(0) # discard name of topology script from argv list as we just want IP addresses
netmask = '255.255.255.0' # set netmask to what's being used in your environment. The example uses a /24
for ip in sys.argv: # loop over list of datanode IP's
address = '{0}/{1}'.format(ip, netmask) # format address string so it looks like 'ip/netmask' to make netaddr work
try:
network_address = netaddr.IPNetwork(address).network # calculate and print network address
print "/{0}".format(network_address)
except:
print "/rack-unknown" # print catch-all value if unable to calculate network address
</code>
</source>
</section>
<section>
<title>Bash example</title>
<source>
<code>
#!/bin/bash
# Here's a bash example to show just how simple these scripts can be
# Assuming we have flat network with everything on a single switch, we can fake a rack topology.
# This could occur in a lab environment where we have limited nodes,like 2-8 physical machines on a unmanaged switch.
# This may also apply to multiple virtual machines running on the same physical hardware.
# The number of machines isn't important, but that we are trying to fake a network topology when there isn't one.
#
# +----------+ +--------+
# |jobtracker| |datanode|
# +----------+ +--------+
# \ /
# +--------+ +--------+ +--------+
# |datanode|--| switch |--|datanode|
# +--------+ +--------+ +--------+
# / \
# +--------+ +--------+
# |datanode| |namenode|
# +--------+ +--------+
#
# With this network topology, we are treating each host as a rack. This is being done by taking the last octet
# in the datanode's IP and prepending it with the word '/rack-'. The advantage for doing this is so HDFS
# can create its 'off-rack' block copy.
# 1) 'echo $@' will echo all ARGV values to xargs.
# 2) 'xargs' will enforce that we print a single argv value per line
# 3) 'awk' will split fields on dots and append the last field to the string '/rack-'. If awk
# fails to split on four dots, it will still print '/rack-' last field value
echo $@ | xargs -n 1 | awk -F '.' '{print "/rack-"$NF}'
</code>
</source>
</section>
<p>
If <code>topology.script.file.name</code> or <code>topology.node.switch.mapping.impl</code> is
not set, the rack id '/default-rack' is returned for any passed IP address.
While this behavior appears desirable, it can cause issues with HDFS block replication as
default behavior is to write one replicated block off rack and is unable to do so as there is
only a single rack named '/default-rack'.
</p>
<p>
An additional configuration setting is <code>mapred.cache.task.levels</code> which determines
the number of levels (in the network topology) of caches. So, for example, if it is the
default value of 2, two levels of caches will be constructed - one for hosts
(host -> task mapping) and another for racks (rack -> task mapping). Giving us our one-to-one
mapping of '/myrack/myhost'
</p>
</section>