HADOOP-6616. Improve documentation for rack awareness. Contributed by Adam Faris.
git-svn-id: https://svn.apache.org/repos/asf/hadoop/common/trunk@1411359 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
parent
c271f3cded
commit
e9f01ac284
|
@ -132,6 +132,8 @@ Trunk (Unreleased)
|
||||||
HADOOP-9004. Allow security unit tests to use external KDC. (Stephen Chu
|
HADOOP-9004. Allow security unit tests to use external KDC. (Stephen Chu
|
||||||
via suresh)
|
via suresh)
|
||||||
|
|
||||||
|
HADOOP-6616. Improve documentation for rack awareness. (Adam Faris via jghoman)
|
||||||
|
|
||||||
BUG FIXES
|
BUG FIXES
|
||||||
|
|
||||||
HADOOP-8177. MBeans shouldn't try to register when it fails to create MBeanName.
|
HADOOP-8177. MBeans shouldn't try to register when it fails to create MBeanName.
|
||||||
|
|
|
@ -1292,23 +1292,139 @@
|
||||||
|
|
||||||
<section>
|
<section>
|
||||||
<title>Hadoop Rack Awareness</title>
|
<title>Hadoop Rack Awareness</title>
|
||||||
<p>The HDFS and the Map/Reduce components are rack-aware.</p>
|
<p>
|
||||||
<p>The <code>NameNode</code> and the <code>JobTracker</code> obtains the
|
Both HDFS and Map/Reduce components are rack-aware. HDFS block placement will use rack
|
||||||
<code>rack id</code> of the slaves in the cluster by invoking an API
|
awareness for fault tolerance by placing one block replica on a different rack. This provides
|
||||||
<a href="ext:api/org/apache/hadoop/net/dnstoswitchmapping/resolve
|
data availability in the event of a network switch failure within the cluster. The jobtracker uses rack
|
||||||
">resolve</a> in an administrator configured
|
awareness to reduce network transfers of HDFS data blocks by attempting to schedule tasks on datanodes with a local
|
||||||
module. The API resolves the slave's DNS name (also IP address) to a
|
copy of needed HDFS blocks. If the tasks cannot be scheduled on the datanodes
|
||||||
rack id. What module to use can be configured using the configuration
|
containing the needed HDFS blocks, then the tasks will be scheduled on the same rack to reduce network transfers if possible.
|
||||||
item <code>net.topology.node.switch.mapping.impl</code>. The default
|
</p>
|
||||||
implementation of the same runs a script/command configured using
|
<p>The NameNode and the JobTracker obtain the rack id of the cluster slaves by invoking either
|
||||||
<code>net.topology.script.file.name</code>. If topology.script.file.name is
|
an external script or java class as specified by configuration files. Using either the
|
||||||
not set, the rack id <code>/default-rack</code> is returned for any
|
java class or external script for topology, output must adhere to the java
|
||||||
passed IP address. The additional configuration in the Map/Reduce
|
<a href="ext:api/org/apache/hadoop/net/dnstoswitchmapping/resolve">DNSToSwitchMapping</a>
|
||||||
part is <code>mapred.cache.task.levels</code> which determines the number
|
interface. The interface expects a one-to-one correspondence to be maintained
|
||||||
of levels (in the network topology) of caches. So, for example, if it is
|
and the topology information in the format of '/myrack/myhost', where '/' is the topology
|
||||||
the default value of 2, two levels of caches will be constructed -
|
delimiter, 'myrack' is the rack identifier, and 'myhost' is the individual host. Assuming
|
||||||
one for hosts (host -> task mapping) and another for racks
|
a single /24 subnet per rack, one could use the format of '/192.168.100.0/192.168.100.5' as a
|
||||||
(rack -> task mapping).
|
unique rack-host topology mapping.
|
||||||
|
</p>
|
||||||
|
<p>
|
||||||
|
To use the java class for topology mapping, the class name is specified by the
|
||||||
|
<code>'topology.node.switch.mapping.impl'</code> parameter in the configuration file.
|
||||||
|
An example, NetworkTopology.java, is included with the hadoop distribution and can be customized
|
||||||
|
by the hadoop administrator. If not included with your distribution, NetworkTopology.java can also be found in the Hadoop
|
||||||
|
<a href="http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/net/NetworkTopology.java?view=markup">
|
||||||
|
subversion tree</a>. Using a java class instead of an external script has a slight performance benefit in
|
||||||
|
that it doesn't need to fork an external process when a new slave node registers itself with the jobtracker or namenode.
|
||||||
|
As this class is only used during slave node registration, the performance benefit is limited.
|
||||||
|
</p>
|
||||||
|
<p>
|
||||||
|
If implementing an external script, it will be specified with the
|
||||||
|
<code>topology.script.file.name</code> parameter in the configuration files. Unlike the java
|
||||||
|
class, the external topology script is not included with the Hadoop distribution and is provided by the
|
||||||
|
administrator. Hadoop will send multiple IP addresses to ARGV when forking the topology script. The
|
||||||
|
number of IP addresses sent to the topology script is controlled with <code>net.topology.script.number.args</code>
|
||||||
|
and defaults to 100. If <code>net.topology.script.number.args</code> was changed to 1, a topology script would
|
||||||
|
get forked for each IP submitted by datanodes and/or tasktrackers. Below are example topology scripts.
|
||||||
|
</p>
|
||||||
|
<section>
|
||||||
|
<title>Python example</title>
|
||||||
|
<source>
|
||||||
|
<code>
|
||||||
|
#!/usr/bin/python
|
||||||
|
|
||||||
|
# this script makes assumptions about the physical environment.
|
||||||
|
# 1) each rack is its own layer 3 network with a /24 subnet, which could be typical where each rack has its own
|
||||||
|
# switch with uplinks to a central core router.
|
||||||
|
#
|
||||||
|
# +-----------+
|
||||||
|
# |core router|
|
||||||
|
# +-----------+
|
||||||
|
# / \
|
||||||
|
# +-----------+ +-----------+
|
||||||
|
# |rack switch| |rack switch|
|
||||||
|
# +-----------+ +-----------+
|
||||||
|
# | data node | | data node |
|
||||||
|
# +-----------+ +-----------+
|
||||||
|
# | data node | | data node |
|
||||||
|
# +-----------+ +-----------+
|
||||||
|
#
|
||||||
|
# 2) topology script gets list of IP's as input, calculates network address, and prints '/network_address/ip'.
|
||||||
|
|
||||||
|
import netaddr
|
||||||
|
import sys
|
||||||
|
sys.argv.pop(0) # discard name of topology script from argv list as we just want IP addresses
|
||||||
|
|
||||||
|
netmask = '255.255.255.0' # set netmask to what's being used in your environment. The example uses a /24
|
||||||
|
|
||||||
|
for ip in sys.argv: # loop over list of datanode IP's
|
||||||
|
address = '{0}/{1}'.format(ip, netmask) # format address string so it looks like 'ip/netmask' to make netaddr work
|
||||||
|
try:
|
||||||
|
network_address = netaddr.IPNetwork(address).network # calculate and print network address
|
||||||
|
print "/{0}".format(network_address)
|
||||||
|
except:
|
||||||
|
print "/rack-unknown" # print catch-all value if unable to calculate network address
|
||||||
|
|
||||||
|
</code>
|
||||||
|
</source>
|
||||||
|
</section>
|
||||||
|
|
||||||
|
<section>
|
||||||
|
<title>Bash example</title>
|
||||||
|
<source>
|
||||||
|
<code>
|
||||||
|
#!/bin/bash
|
||||||
|
# Here's a bash example to show just how simple these scripts can be
|
||||||
|
|
||||||
|
# Assuming we have flat network with everything on a single switch, we can fake a rack topology.
|
||||||
|
# This could occur in a lab environment where we have limited nodes,like 2-8 physical machines on a unmanaged switch.
|
||||||
|
# This may also apply to multiple virtual machines running on the same physical hardware.
|
||||||
|
# The number of machines isn't important, but that we are trying to fake a network topology when there isn't one.
|
||||||
|
#
|
||||||
|
# +----------+ +--------+
|
||||||
|
# |jobtracker| |datanode|
|
||||||
|
# +----------+ +--------+
|
||||||
|
# \ /
|
||||||
|
# +--------+ +--------+ +--------+
|
||||||
|
# |datanode|--| switch |--|datanode|
|
||||||
|
# +--------+ +--------+ +--------+
|
||||||
|
# / \
|
||||||
|
# +--------+ +--------+
|
||||||
|
# |datanode| |namenode|
|
||||||
|
# +--------+ +--------+
|
||||||
|
#
|
||||||
|
# With this network topology, we are treating each host as a rack. This is being done by taking the last octet
|
||||||
|
# in the datanode's IP and prepending it with the word '/rack-'. The advantage for doing this is so HDFS
|
||||||
|
# can create its 'off-rack' block copy.
|
||||||
|
|
||||||
|
# 1) 'echo $@' will echo all ARGV values to xargs.
|
||||||
|
# 2) 'xargs' will enforce that we print a single argv value per line
|
||||||
|
# 3) 'awk' will split fields on dots and append the last field to the string '/rack-'. If awk
|
||||||
|
# fails to split on four dots, it will still print '/rack-' last field value
|
||||||
|
|
||||||
|
echo $@ | xargs -n 1 | awk -F '.' '{print "/rack-"$NF}'
|
||||||
|
|
||||||
|
|
||||||
|
</code>
|
||||||
|
</source>
|
||||||
|
</section>
|
||||||
|
|
||||||
|
|
||||||
|
<p>
|
||||||
|
If <code>topology.script.file.name</code> or <code>topology.node.switch.mapping.impl</code> is
|
||||||
|
not set, the rack id '/default-rack' is returned for any passed IP address.
|
||||||
|
While this behavior appears desirable, it can cause issues with HDFS block replication as
|
||||||
|
default behavior is to write one replicated block off rack and is unable to do so as there is
|
||||||
|
only a single rack named '/default-rack'.
|
||||||
|
</p>
|
||||||
|
<p>
|
||||||
|
An additional configuration setting is <code>mapred.cache.task.levels</code> which determines
|
||||||
|
the number of levels (in the network topology) of caches. So, for example, if it is the
|
||||||
|
default value of 2, two levels of caches will be constructed - one for hosts
|
||||||
|
(host -> task mapping) and another for racks (rack -> task mapping). Giving us our one-to-one
|
||||||
|
mapping of '/myrack/myhost'
|
||||||
</p>
|
</p>
|
||||||
</section>
|
</section>
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue