HADOOP-6616. Improve documentation for rack awareness. Contributed by Adam Faris.

git-svn-id: https://svn.apache.org/repos/asf/hadoop/common/trunk@1411359 13f79535-47bb-0310-9956-ffa450edef68
2012-11-19 19:27:00 +00:00 · 2012-11-19 19:27:00 +00:00 · e9f01ac284
parent c271f3cded
commit e9f01ac284
2 changed files with 135 additions and 17 deletions
--- a/hadoop-common-project/hadoop-common/CHANGES.txt
+++ b/hadoop-common-project/hadoop-common/CHANGES.txt
@ -132,6 +132,8 @@ Trunk (Unreleased)
    HADOOP-9004. Allow security unit tests to use external KDC. (Stephen Chu
    via suresh)
    HADOOP-6616. Improve documentation for rack awareness. (Adam Faris via jghoman)
  BUG FIXES
    HADOOP-8177. MBeans shouldn't try to register when it fails to create MBeanName.
--- a/hadoop-common-project/hadoop-common/src/main/docs/src/documentation/content/xdocs/cluster_setup.xml
+++ b/hadoop-common-project/hadoop-common/src/main/docs/src/documentation/content/xdocs/cluster_setup.xml
@ -1292,23 +1292,139 @@
    <section>
      <title>Hadoop Rack Awareness</title>
-      <p>The HDFS and the Map/Reduce components are rack-aware.</p>
+      <p>
-      <p>The <code>NameNode</code> and the <code>JobTracker</code> obtains the
+         Both HDFS and Map/Reduce components are rack-aware.  HDFS block placement will use rack 
-      <code>rack id</code> of the slaves in the cluster by invoking an API 
+         awareness for fault tolerance by placing one block replica on a different rack.  This provides 
-      <a href="ext:api/org/apache/hadoop/net/dnstoswitchmapping/resolve
+         data availability in the event of a network switch failure within the cluster.  The jobtracker uses rack
-      ">resolve</a> in an administrator configured
+         awareness to reduce network transfers of HDFS data blocks by attempting to schedule tasks on datanodes with a local
-      module. The API resolves the slave's DNS name (also IP address) to a 
+         copy of needed HDFS blocks.  If the tasks cannot be scheduled on the datanodes
-      rack id. What module to use can be configured using the configuration
+         containing the needed HDFS blocks, then the tasks will be scheduled on the same rack to reduce network transfers if possible.
-      item <code>net.topology.node.switch.mapping.impl</code>. The default 
+      </p>
-      implementation of the same runs a script/command configured using 
+      <p>The NameNode and the JobTracker obtain the rack id of the cluster slaves by invoking either 
-      <code>net.topology.script.file.name</code>. If topology.script.file.name is
+         an external script or java class as specified by configuration files.  Using either the 
-      not set, the rack id <code>/default-rack</code> is returned for any 
+         java class or external script for topology, output must adhere to the java 
-      passed IP address. The additional configuration in the Map/Reduce
+         <a href="ext:api/org/apache/hadoop/net/dnstoswitchmapping/resolve">DNSToSwitchMapping</a> 
-      part is <code>mapred.cache.task.levels</code> which determines the number
+         interface.  The interface expects a one-to-one correspondence to be maintained 
-      of levels (in the network topology) of caches. So, for example, if it is
+         and the topology information in the format of '/myrack/myhost', where '/' is the topology 
-      the default value of 2, two levels of caches will be constructed - 
+         delimiter, 'myrack' is the rack identifier, and 'myhost' is the individual host.  Assuming 
-      one for hosts (host -> task mapping) and another for racks 
+         a single /24 subnet per rack, one could use the format of '/192.168.100.0/192.168.100.5' as a 
-      (rack -> task mapping).
+         unique rack-host topology mapping.
      </p>
      <p>
         To use the java class for topology mapping, the class name is specified by the 
         <code>'topology.node.switch.mapping.impl'</code> parameter in the configuration file. 
         An example, NetworkTopology.java, is included with the hadoop distribution and can be customized 
         by the hadoop administrator.  If not included with your distribution, NetworkTopology.java can also be found in the Hadoop 
         <a href="http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/net/NetworkTopology.java?view=markup">
         subversion tree</a>.  Using a java class instead of an external script has a slight performance benefit in 
         that it doesn't need to fork an external process when a new slave node registers itself with the jobtracker or namenode.  
         As this class is only used during slave node registration, the performance benefit is limited.  
      </p>
      <p>
         If implementing an external script, it will be specified with the
         <code>topology.script.file.name</code> parameter in the configuration files.  Unlike the java 
         class, the external topology script is not included with the Hadoop distribution and is provided by the 
         administrator.  Hadoop will send multiple IP addresses to ARGV when forking the topology script.  The  
         number of IP addresses sent to the topology script is controlled with <code>net.topology.script.number.args</code>
         and defaults to 100. If <code>net.topology.script.number.args</code> was changed to 1, a topology script would 
         get forked for each IP submitted by datanodes and/or tasktrackers.  Below are example topology scripts.
      </p>
      <section>
      <title>Python example</title>
      <source>
      <code>
      #!/usr/bin/python
      # this script makes assumptions about the physical environment.
      #  1) each rack is its own layer 3 network with a /24 subnet, which could be typical where each rack has its own
      #     switch with uplinks to a central core router.
      #     
      #             +-----------+
      #             |core router|
      #             +-----------+
      #            /             \
      #   +-----------+        +-----------+
      #   |rack switch|        |rack switch|
      #   +-----------+        +-----------+
      #   | data node |        | data node |
      #   +-----------+        +-----------+
      #   | data node |        | data node |
      #   +-----------+        +-----------+
      #
      # 2) topology script gets list of IP's as input, calculates network address, and prints '/network_address/ip'.
      import netaddr
      import sys             
      sys.argv.pop(0)                                                  # discard name of topology script from argv list as we just want IP addresses
      netmask = '255.255.255.0'                                        # set netmask to what's being used in your environment.  The example uses a /24
      for ip in sys.argv:                                              # loop over list of datanode IP's
          address = '{0}/{1}'.format(ip, netmask)                      # format address string so it looks like 'ip/netmask' to make netaddr work
          try:
              network_address = netaddr.IPNetwork(address).network     # calculate and print network address
              print "/{0}".format(network_address)                     
          except:
              print "/rack-unknown"                                    # print catch-all value if unable to calculate network address
      </code>
      </source>
      </section>
      <section>
      <title>Bash  example</title>
      <source>
      <code>
      #!/bin/bash
      # Here's a bash example to show just how simple these scripts can be
      # Assuming we have flat network with everything on a single switch, we can fake a rack topology. 
      # This could occur in a lab environment where we have limited nodes,like 2-8 physical machines on a unmanaged switch. 
      # This may also apply to multiple virtual machines running on the same physical hardware.  
      # The number of machines isn't important, but that we are trying to fake a network topology when there isn't one. 
      #
      #       +----------+    +--------+
      #       |jobtracker|    |datanode| 
      #       +----------+    +--------+
      #              \        /
      #  +--------+  +--------+  +--------+
      #  |datanode|--| switch |--|datanode|
      #  +--------+  +--------+  +--------+
      #              /        \
      #       +--------+    +--------+
      #       |datanode|    |namenode| 
      #       +--------+    +--------+
      #
      # With this network topology, we are treating each host as a rack.  This is being done by taking the last octet 
      # in the datanode's IP and prepending it with the word '/rack-'.  The advantage for doing this is so HDFS
      # can create its 'off-rack' block copy.
      # 1) 'echo $@' will echo all ARGV values to xargs.  
      # 2) 'xargs' will enforce that we print a single argv value per line
      # 3) 'awk' will split fields on dots and append the last field to the string '/rack-'. If awk 
      #    fails to split on four dots, it will still print '/rack-' last field value
      echo $@ | xargs -n 1 | awk -F '.' '{print "/rack-"$NF}'
      </code>
      </source>
      </section>
      <p>
         If <code>topology.script.file.name</code> or <code>topology.node.switch.mapping.impl</code> is 
         not set, the rack id '/default-rack' is returned for any passed IP address.  
         While this behavior appears desirable, it can cause issues with HDFS block replication as 
         default behavior is to write one replicated block off rack and is unable to do so as there is 
         only a single rack named '/default-rack'.
      </p>
      <p>
         An additional configuration setting is <code>mapred.cache.task.levels</code> which determines 
         the number of levels (in the network topology) of caches. So, for example, if it is the 
         default value of 2, two levels of caches will be constructed - one for hosts 
         (host -> task mapping) and another for racks (rack -> task mapping). Giving us our one-to-one 
          mapping of '/myrack/myhost'
      </p>
    </section>