HADOOP-6821. Document changes to memory monitoring. Contributed by Hemanth Yamijala.

git-svn-id: https://svn.apache.org/repos/asf/hadoop/common/trunk@954647 13f79535-47bb-0310-9956-ffa450edef68
2010-06-14 21:16:47 +00:00 · 2010-06-14 21:16:47 +00:00 · c6b8095b02
parent e89ac4b077
commit c6b8095b02
3 changed files with 245 additions and 135 deletions
--- a/CHANGES.txt
+++ b/CHANGES.txt
@ -958,6 +958,9 @@ Release 0.21.0 - Unreleased
    HADOOP-6668.  Apply audience and stability annotations to classes in
    common.  (tomwhite)
    HADOOP-6821.  Document changes to memory monitoring.  (Hemanth Yamijala
    via tomwhite)
  OPTIMIZATIONS
    HADOOP-5595. NameNode does not need to run a replicator to choose a
--- a/src/docs/src/documentation/content/xdocs/cluster_setup.xml
+++ b/src/docs/src/documentation/content/xdocs/cluster_setup.xml
@ -739,147 +739,251 @@
              </li>
            </ul>
          </section>
-                  <section>
+        <section>
-        <title> Memory management</title>
+        <title>Configuring Memory Parameters for MapReduce Jobs</title>
-        <p>Users/admins can also specify the maximum virtual memory 
+        <p>
-        of the launched child-task, and any sub-process it launches 
+        As MapReduce jobs could use varying amounts of memory, Hadoop
-        recursively, using <code>mapred.{map|reduce}.child.ulimit</code>. Note 
+        provides various configuration options to users and administrators
-        that the value set here is a per process limit.
+        for managing memory effectively. Some of these options are job 
-        The value for <code>mapred.{map|reduce}.child.ulimit</code> should be 
+        specific and can be used by users. While setting up a cluster, 
-        specified in kilo bytes (KB). And also the value must be greater than
+        administrators can configure appropriate default values for these 
-        or equal to the -Xmx passed to JavaVM, else the VM might not start. 
+        options so that users jobs run out of the box. Other options are 
        cluster specific and can be used by administrators to enforce 
        limits and prevent misconfigured or memory intensive jobs from 
        causing undesired side effects on the cluster.
        </p>
        <p> 
        The values configured should
        take into account the hardware resources of the cluster, such as the
        amount of physical and virtual memory available for tasks,
        the number of slots configured on the slaves and the requirements
        for other processes running on the slaves. If right values are not
        set, it is likely that jobs start failing with memory related
        errors or in the worst case, even affect other tasks or
        the slaves themselves.
        </p>
-        <p>Note: <code>mapred.{map|reduce}.child.java.opts</code> are used only for 
+        <section>
-        configuring the launched child tasks from task tracker. Configuring 
+          <title>Monitoring Task Memory Usage</title>
-        the memory options for daemons is documented in 
+          <p>
-        <a href="cluster_setup.html#Configuring+the+Environment+of+the+Hadoop+Daemons">
+          Before describing the memory options, it is
-        cluster_setup.html </a></p>
+          useful to look at a feature provided by Hadoop to monitor
          memory usage of MapReduce tasks it runs. The basic objective
          of this feature is to prevent MapReduce tasks from consuming
          memory beyond a limit that would result in their affecting
          other processes running on the slave, including other tasks
          and daemons like the DataNode or TaskTracker.
          </p>
          <p>
          <em>Note:</em> For the time being, this feature is available
          only for the Linux platform.
          </p>
          <p>
          Hadoop allows monitoring to be done both for virtual
          and physical memory usage of tasks. This monitoring 
          can be done independently of each other, and therefore the
          options can be configured independently of each other. It
          has been found in some environments, particularly related
          to streaming, that virtual memory recorded for tasks is high
          because of libraries loaded by the programs used to run
          the tasks. However, this memory is largely unused and does
          not affect the slaves's memory itself. In such cases,
          monitoring based on physical memory can provide a more
          accurate picture of memory usage.
          </p>
          <p>
          This feature considers that there is a limit on
          the amount of virtual or physical memory on the slaves 
          that can be used by
          the running MapReduce tasks. The rest of the memory is
          assumed to be required for the system and other processes.
          Since some jobs may require higher amount of memory for their
          tasks than others, Hadoop allows jobs to specify how much
          memory they expect to use at a maximum. Then by using
          resource aware scheduling and monitoring, Hadoop tries to
          ensure that at any time, only enough tasks are running on
          the slaves as can meet the dual constraints of an individual
          job's memory requirements and the total amount of memory
          available for all MapReduce tasks.
          </p>
          <p>
          The TaskTracker monitors tasks in regular intervals. Each time,
          it operates in two steps:
          </p> 
          <ul>
          <li>
          In the first step, it
          checks that a job's task and any child processes it
          launches are not cumulatively using more virtual or physical 
          memory than specified. If both virtual and physical memory
          monitoring is enabled, then virtual memory usage is checked
          first, followed by physical memory usage. 
          Any task that is found to
          use more memory is killed along with any child processes it
          might have launched, and the task status is marked
          <em>failed</em>. Repeated failures such as this will terminate
          the job. 
          </li>
          <li>
          In the next step, it checks that the cumulative virtual and
          physical memory 
          used by all running tasks and their child processes
          does not exceed the total virtual and physical memory limit,
          respectively. Again, virtual memory limit is checked first, 
          followed by physical memory limit. In this case, it kills
          enough number of tasks, along with any child processes they
          might have launched, until the cumulative memory usage
          is brought under limit. In the case of virtual memory limit
          being exceeded, the tasks chosen for killing are
          the ones that have made the least progress. In the case of
          physical memory limit being exceeded, the tasks chosen
          for killing are the ones that have used the maximum amount
          of physical memory. Also, the status
          of these tasks is marked as <em>killed</em>, and hence repeated
          occurrence of this will not result in a job failure. 
          </li>
          </ul>
          <p>
          In either case, the task's diagnostic message will indicate the
          reason why the task was terminated.
          </p>
          <p>
          Resource aware scheduling can ensure that tasks are scheduled
          on a slave only if their memory requirement can be satisfied
          by the slave. The Capacity Scheduler, for example,
          takes virtual memory requirements into account while 
          scheduling tasks, as described in the section on 
          <a href="ext:capacity-scheduler/MemoryBasedTaskScheduling"> 
          memory based scheduling</a>.
          </p>
          <p>
          Memory monitoring is enabled when certain configuration 
          variables are defined with non-zero values, as described below.
          </p>
        <p>The memory available to some parts of the framework is also
        configurable. In map and reduce tasks, performance may be influenced
        by adjusting parameters influencing the concurrency of operations and
        the frequency with which data will hit disk. Monitoring the filesystem
        counters for a job- particularly relative to byte counts from the map
        and into the reduce- is invaluable to the tuning of these
        parameters.</p>
        </section>
        <section>
-        <title> Memory monitoring</title>
+          <title>Job Specific Options</title>
-        <p>A <code>TaskTracker</code>(TT) can be configured to monitor memory 
+          <p>
-        usage of tasks it spawns, so that badly-behaved jobs do not bring 
+          Memory related options that can be configured individually per 
-        down a machine due to excess memory consumption. With monitoring 
+          job are described in detail in the section on
-        enabled, every task is assigned a task-limit for virtual memory (VMEM). 
+          <a href="ext:mapred-tutorial/ConfiguringMemoryRequirements">
-        In addition, every node is assigned a node-limit for VMEM usage. 
+          Configuring Memory Requirements For A Job</a> in the MapReduce
-        A TT ensures that a task is killed if it, and 
+          tutorial. While setting up
-        its descendants, use VMEM over the task's per-task limit. It also 
+          the cluster, the Hadoop defaults for these options can be reviewed
-        ensures that one or more tasks are killed if the sum total of VMEM 
+          and changed to better suit the job profiles expected to be run on
-        usage by all tasks, and their descendents, cross the node-limit.</p>
+          the clusters, as also the hardware configuration.
-        
+          </p>
-        <p>Users can, optionally, specify the VMEM task-limit per job. If no
+          <p>
-        such limit is provided, a default limit is used. A node-limit can be 
+          As with any other configuration option in Hadoop, if the 
-        set per node.</p>   
+          administrators desire to prevent users from overriding these 
-        <p>Currently the memory monitoring and management is only supported
+          options in jobs they submit, these values can be marked as
-        in Linux platform.</p>
+          <em>final</em> in the cluster configuration.
-        <p>To enable monitoring for a TT, the 
+          </p>
        following parameters all need to be set:</p> 
        <table>
          <tr><th>Name</th><th>Type</th><th>Description</th></tr>
          <tr><td>mapred.tasktracker.vmem.reserved</td><td>long</td>
            <td>A number, in bytes, that represents an offset. The total VMEM on 
            the machine, minus this offset, is the VMEM node-limit for all 
            tasks, and their descendants, spawned by the TT. 
          </td></tr>
          <tr><td>mapred.task.default.maxvmem</td><td>long</td>
            <td>A number, in bytes, that represents the default VMEM task-limit 
            associated with a task. Unless overridden by a job's setting, 
            this number defines the VMEM task-limit.   
          </td></tr>
          <tr><td>mapred.task.limit.maxvmem</td><td>long</td>
            <td>A number, in bytes, that represents the upper VMEM task-limit 
            associated with a task. Users, when specifying a VMEM task-limit 
            for their tasks, should not specify a limit which exceeds this amount. 
          </td></tr>
        </table>
        <p>In addition, the following parameters can also be configured.</p>
    <table>
          <tr><th>Name</th><th>Type</th><th>Description</th></tr>
          <tr><td>mapreduce.tasktracker.taskmemorymanager.monitoringinterval</td>
            <td>long</td>
            <td>The time interval, in milliseconds, between which the TT 
            checks for any memory violation. The default value is 5000 msec
            (5 seconds). 
          </td></tr>
        </table>
        <p>Here's how the memory monitoring works for a TT.</p>
        <ol>
          <li>If one or more of the configuration parameters described 
          above are missing or -1 is specified , memory monitoring is 
          disabled for the TT.
          </li>
          <li>In addition, monitoring is disabled if 
          <code>mapred.task.default.maxvmem</code> is greater than 
          <code>mapred.task.limit.maxvmem</code>. 
          </li>
          <li>If a TT receives a task whose task-limit is set by the user
          to a value larger than <code>mapred.task.limit.maxvmem</code>, it 
          logs a warning but executes the task.
          </li> 
          <li>Periodically, the TT checks the following: 
          <ul>
            <li>If any task's current VMEM usage is greater than that task's
            VMEM task-limit, the task is killed and reason for killing 
            the task is logged in task diagonistics . Such a task is considered 
            failed, i.e., the killing counts towards the task's failure count.
            </li> 
            <li>If the sum total of VMEM used by all tasks and descendants is 
            greater than the node-limit, the TT kills enough tasks, in the
            order of least progress made, till the overall VMEM usage falls
            below the node-limt. Such killed tasks are not considered failed
            and their killing does not count towards the tasks' failure counts.
            </li>
          </ul>
          </li>
        </ol>
        <p>Schedulers can choose to ease the monitoring pressure on the TT by 
        preventing too many tasks from running on a node and by scheduling 
        tasks only if the TT has enough VMEM free. In addition, Schedulers may 
        choose to consider the physical memory (RAM) available on the node
        as well. To enable Scheduler support, TTs report their memory settings 
        to the JobTracker in every heartbeat. Before getting into details, 
        consider the following additional memory-related parameters than can be 
        configured to enable better scheduling:</p> 
        <table>
          <tr><th>Name</th><th>Type</th><th>Description</th></tr>
          <tr><td>mapred.tasktracker.pmem.reserved</td><td>int</td>
            <td>A number, in bytes, that represents an offset. The total 
            physical memory (RAM) on the machine, minus this offset, is the 
            recommended RAM node-limit. The RAM node-limit is a hint to a
            Scheduler to scheduler only so many tasks such that the sum 
            total of their RAM requirements does not exceed this limit. 
            RAM usage is not monitored by a TT.   
          </td></tr>
        </table>
        <p>A TT reports the following memory-related numbers in every 
        heartbeat:</p>
        <ul>
          <li>The total VMEM available on the node.</li>
          <li>The value of <code>mapred.tasktracker.vmem.reserved</code>,
           if set.</li>
          <li>The total RAM available on the node.</li> 
          <li>The value of <code>mapred.tasktracker.pmem.reserved</code>,
           if set.</li>
         </ul>
        </section>
        <section>
          <title>Cluster Specific Options</title>
          <p>
          This section describes the memory related options that are
          used by the JobTracker and TaskTrackers, and cannot be changed 
          by jobs. The values set for these options should be the same
          for all the slave nodes in a cluster.
          </p>
          <ul>
          <li>
          <code>mapreduce.cluster.{map|reduce}memory.mb</code>: These
          options define the default amount of virtual memory that should be
          allocated for MapReduce tasks running in the cluster. They
          typically match the default values set for the options
          <code>mapreduce.{map|reduce}.memory.mb</code>. They help in the
          calculation of the total amount of virtual memory available for 
          MapReduce tasks on a slave, using the following equation:<br/>
          <em>Total virtual memory for all MapReduce tasks = 
          (mapreduce.cluster.mapmemory.mb * 
           mapreduce.tasktracker.map.tasks.maximum) +
          (mapreduce.cluster.reducememory.mb * 
           mapreduce.tasktracker.reduce.tasks.maximum)</em><br/>
          Typically, reduce tasks require more memory than map tasks.
          Hence a higher value is recommended for 
          <em>mapreduce.cluster.reducememory.mb</em>. The value is  
          specified in MB. To set a value of 2GB for reduce tasks, set
          <em>mapreduce.cluster.reducememory.mb</em> to 2048.
          </li>
          <li>
          <code>mapreduce.jobtracker.max{map|reduce}memory.mb</code>:
          These options define the maximum amount of virtual memory that 
          can be requested by jobs using the parameters
          <code>mapreduce.{map|reduce}.memory.mb</code>. The system
          will reject any job that is submitted requesting for more
          memory than these limits. Typically, the values for these
          options should be set to satisfy the following constraint:<br/>
          <em>mapreduce.jobtracker.maxmapmemory.mb =
            mapreduce.cluster.mapmemory.mb * 
             mapreduce.tasktracker.map.tasks.maximum<br/>
              mapreduce.jobtracker.maxreducememory.mb =
            mapreduce.cluster.reducememory.mb * 
             mapreduce.tasktracker.reduce.tasks.maximum</em><br/>
          The value is specified in MB. If 
          <code>mapreduce.cluster.reducememory.mb</code> is set to 2GB and
          there are 2 reduce slots configured in the slaves, the value
          for <code>mapreduce.jobtracker.maxreducememory.mb</code> should 
          be set to 4096.
          </li>
          <li>
          <code>mapreduce.tasktracker.reserved.physicalmemory.mb</code>:
          This option defines the amount of physical memory that is
          marked for system and daemon processes. Using this, the amount
          of physical memory available for MapReduce tasks is calculated
          using the following equation:<br/>
          <em>Total physical memory for all MapReduce tasks = 
                Total physical memory available on the system -
                mapreduce.tasktracker.reserved.physicalmemory.mb</em><br/>
          The value is specified in MB. To set this value to 2GB, 
          specify the value as 2048.
          </li>
          <li>
          <code>mapreduce.tasktracker.taskmemorymanager.monitoringinterval</code>:
          This option defines the time the TaskTracker waits between
          two cycles of memory monitoring. The value is specified in 
          milliseconds.
          </li>
          </ul>
          <p>
          <em>Note:</em> The virtual memory monitoring function is only 
          enabled if
          the variables <code>mapreduce.cluster.{map|reduce}memory.mb</code>
          and <code>mapreduce.jobtracker.max{map|reduce}memory.mb</code>
          are set to values greater than zero. Likewise, the physical
          memory monitoring function is only enabled if the variable
          <code>mapreduce.tasktracker.reserved.physicalmemory.mb</code>
          is set to a value greater than zero.
          </p>
        </section>
      </section>
          <section>
            <title>Task Controllers</title>
            <p>Task controllers are classes in the Hadoop Map/Reduce 
--- a/src/docs/src/documentation/content/xdocs/site.xml
+++ b/src/docs/src/documentation/content/xdocs/site.xml
@ -72,9 +72,12 @@ See http://forrest.apache.org/docs/linking.html for more info.
    <mapred-default href="http://hadoop.apache.org/mapreduce/docs/current/mapred-default.html" />
    <mapred-queues href="http://hadoop.apache.org/mapreduce/docs/current/mapred_queues.xml" />
-    <capacity-scheduler href="http://hadoop.apache.org/mapreduce/docs/current/capacity_scheduler.html" />
+    <capacity-scheduler href="http://hadoop.apache.org/mapreduce/docs/current/capacity_scheduler.html">
        <MemoryBasedTaskScheduling href="#Scheduling+Tasks+Considering+Memory+Requirements" />
    </capacity-scheduler>
    <mapred-tutorial href="http://hadoop.apache.org/mapreduce/docs/current/mapred_tutorial.html" >
        <JobAuthorization href="#Job+Authorization" />
        <ConfiguringMemoryRequirements href="#Configuring+Memory+Requirements+For+A+Job" />
    </mapred-tutorial>
    <streaming href="http://hadoop.apache.org/mapreduce/docs/current/streaming.html" />
    <distcp href="http://hadoop.apache.org/mapreduce/docs/current/distcp.html" />