HADOOP-6821. Document changes to memory monitoring. Contributed by Hemanth Yamijala.
git-svn-id: https://svn.apache.org/repos/asf/hadoop/common/trunk@954647 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
parent
e89ac4b077
commit
c6b8095b02
|
@ -958,6 +958,9 @@ Release 0.21.0 - Unreleased
|
||||||
HADOOP-6668. Apply audience and stability annotations to classes in
|
HADOOP-6668. Apply audience and stability annotations to classes in
|
||||||
common. (tomwhite)
|
common. (tomwhite)
|
||||||
|
|
||||||
|
HADOOP-6821. Document changes to memory monitoring. (Hemanth Yamijala
|
||||||
|
via tomwhite)
|
||||||
|
|
||||||
OPTIMIZATIONS
|
OPTIMIZATIONS
|
||||||
|
|
||||||
HADOOP-5595. NameNode does not need to run a replicator to choose a
|
HADOOP-5595. NameNode does not need to run a replicator to choose a
|
||||||
|
|
|
@ -739,147 +739,251 @@
|
||||||
</li>
|
</li>
|
||||||
</ul>
|
</ul>
|
||||||
</section>
|
</section>
|
||||||
<section>
|
<section>
|
||||||
<title> Memory management</title>
|
<title>Configuring Memory Parameters for MapReduce Jobs</title>
|
||||||
<p>Users/admins can also specify the maximum virtual memory
|
<p>
|
||||||
of the launched child-task, and any sub-process it launches
|
As MapReduce jobs could use varying amounts of memory, Hadoop
|
||||||
recursively, using <code>mapred.{map|reduce}.child.ulimit</code>. Note
|
provides various configuration options to users and administrators
|
||||||
that the value set here is a per process limit.
|
for managing memory effectively. Some of these options are job
|
||||||
The value for <code>mapred.{map|reduce}.child.ulimit</code> should be
|
specific and can be used by users. While setting up a cluster,
|
||||||
specified in kilo bytes (KB). And also the value must be greater than
|
administrators can configure appropriate default values for these
|
||||||
or equal to the -Xmx passed to JavaVM, else the VM might not start.
|
options so that users jobs run out of the box. Other options are
|
||||||
|
cluster specific and can be used by administrators to enforce
|
||||||
|
limits and prevent misconfigured or memory intensive jobs from
|
||||||
|
causing undesired side effects on the cluster.
|
||||||
|
</p>
|
||||||
|
<p>
|
||||||
|
The values configured should
|
||||||
|
take into account the hardware resources of the cluster, such as the
|
||||||
|
amount of physical and virtual memory available for tasks,
|
||||||
|
the number of slots configured on the slaves and the requirements
|
||||||
|
for other processes running on the slaves. If right values are not
|
||||||
|
set, it is likely that jobs start failing with memory related
|
||||||
|
errors or in the worst case, even affect other tasks or
|
||||||
|
the slaves themselves.
|
||||||
</p>
|
</p>
|
||||||
|
|
||||||
<p>Note: <code>mapred.{map|reduce}.child.java.opts</code> are used only for
|
<section>
|
||||||
configuring the launched child tasks from task tracker. Configuring
|
<title>Monitoring Task Memory Usage</title>
|
||||||
the memory options for daemons is documented in
|
<p>
|
||||||
<a href="cluster_setup.html#Configuring+the+Environment+of+the+Hadoop+Daemons">
|
Before describing the memory options, it is
|
||||||
cluster_setup.html </a></p>
|
useful to look at a feature provided by Hadoop to monitor
|
||||||
|
memory usage of MapReduce tasks it runs. The basic objective
|
||||||
|
of this feature is to prevent MapReduce tasks from consuming
|
||||||
|
memory beyond a limit that would result in their affecting
|
||||||
|
other processes running on the slave, including other tasks
|
||||||
|
and daemons like the DataNode or TaskTracker.
|
||||||
|
</p>
|
||||||
|
|
||||||
|
<p>
|
||||||
|
<em>Note:</em> For the time being, this feature is available
|
||||||
|
only for the Linux platform.
|
||||||
|
</p>
|
||||||
|
|
||||||
|
<p>
|
||||||
|
Hadoop allows monitoring to be done both for virtual
|
||||||
|
and physical memory usage of tasks. This monitoring
|
||||||
|
can be done independently of each other, and therefore the
|
||||||
|
options can be configured independently of each other. It
|
||||||
|
has been found in some environments, particularly related
|
||||||
|
to streaming, that virtual memory recorded for tasks is high
|
||||||
|
because of libraries loaded by the programs used to run
|
||||||
|
the tasks. However, this memory is largely unused and does
|
||||||
|
not affect the slaves's memory itself. In such cases,
|
||||||
|
monitoring based on physical memory can provide a more
|
||||||
|
accurate picture of memory usage.
|
||||||
|
</p>
|
||||||
|
|
||||||
|
<p>
|
||||||
|
This feature considers that there is a limit on
|
||||||
|
the amount of virtual or physical memory on the slaves
|
||||||
|
that can be used by
|
||||||
|
the running MapReduce tasks. The rest of the memory is
|
||||||
|
assumed to be required for the system and other processes.
|
||||||
|
Since some jobs may require higher amount of memory for their
|
||||||
|
tasks than others, Hadoop allows jobs to specify how much
|
||||||
|
memory they expect to use at a maximum. Then by using
|
||||||
|
resource aware scheduling and monitoring, Hadoop tries to
|
||||||
|
ensure that at any time, only enough tasks are running on
|
||||||
|
the slaves as can meet the dual constraints of an individual
|
||||||
|
job's memory requirements and the total amount of memory
|
||||||
|
available for all MapReduce tasks.
|
||||||
|
</p>
|
||||||
|
|
||||||
|
<p>
|
||||||
|
The TaskTracker monitors tasks in regular intervals. Each time,
|
||||||
|
it operates in two steps:
|
||||||
|
</p>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
|
||||||
|
<li>
|
||||||
|
In the first step, it
|
||||||
|
checks that a job's task and any child processes it
|
||||||
|
launches are not cumulatively using more virtual or physical
|
||||||
|
memory than specified. If both virtual and physical memory
|
||||||
|
monitoring is enabled, then virtual memory usage is checked
|
||||||
|
first, followed by physical memory usage.
|
||||||
|
Any task that is found to
|
||||||
|
use more memory is killed along with any child processes it
|
||||||
|
might have launched, and the task status is marked
|
||||||
|
<em>failed</em>. Repeated failures such as this will terminate
|
||||||
|
the job.
|
||||||
|
</li>
|
||||||
|
|
||||||
|
<li>
|
||||||
|
In the next step, it checks that the cumulative virtual and
|
||||||
|
physical memory
|
||||||
|
used by all running tasks and their child processes
|
||||||
|
does not exceed the total virtual and physical memory limit,
|
||||||
|
respectively. Again, virtual memory limit is checked first,
|
||||||
|
followed by physical memory limit. In this case, it kills
|
||||||
|
enough number of tasks, along with any child processes they
|
||||||
|
might have launched, until the cumulative memory usage
|
||||||
|
is brought under limit. In the case of virtual memory limit
|
||||||
|
being exceeded, the tasks chosen for killing are
|
||||||
|
the ones that have made the least progress. In the case of
|
||||||
|
physical memory limit being exceeded, the tasks chosen
|
||||||
|
for killing are the ones that have used the maximum amount
|
||||||
|
of physical memory. Also, the status
|
||||||
|
of these tasks is marked as <em>killed</em>, and hence repeated
|
||||||
|
occurrence of this will not result in a job failure.
|
||||||
|
</li>
|
||||||
|
|
||||||
|
</ul>
|
||||||
|
|
||||||
|
<p>
|
||||||
|
In either case, the task's diagnostic message will indicate the
|
||||||
|
reason why the task was terminated.
|
||||||
|
</p>
|
||||||
|
|
||||||
|
<p>
|
||||||
|
Resource aware scheduling can ensure that tasks are scheduled
|
||||||
|
on a slave only if their memory requirement can be satisfied
|
||||||
|
by the slave. The Capacity Scheduler, for example,
|
||||||
|
takes virtual memory requirements into account while
|
||||||
|
scheduling tasks, as described in the section on
|
||||||
|
<a href="ext:capacity-scheduler/MemoryBasedTaskScheduling">
|
||||||
|
memory based scheduling</a>.
|
||||||
|
</p>
|
||||||
|
|
||||||
|
<p>
|
||||||
|
Memory monitoring is enabled when certain configuration
|
||||||
|
variables are defined with non-zero values, as described below.
|
||||||
|
</p>
|
||||||
|
|
||||||
<p>The memory available to some parts of the framework is also
|
|
||||||
configurable. In map and reduce tasks, performance may be influenced
|
|
||||||
by adjusting parameters influencing the concurrency of operations and
|
|
||||||
the frequency with which data will hit disk. Monitoring the filesystem
|
|
||||||
counters for a job- particularly relative to byte counts from the map
|
|
||||||
and into the reduce- is invaluable to the tuning of these
|
|
||||||
parameters.</p>
|
|
||||||
</section>
|
</section>
|
||||||
|
|
||||||
<section>
|
<section>
|
||||||
<title> Memory monitoring</title>
|
<title>Job Specific Options</title>
|
||||||
<p>A <code>TaskTracker</code>(TT) can be configured to monitor memory
|
<p>
|
||||||
usage of tasks it spawns, so that badly-behaved jobs do not bring
|
Memory related options that can be configured individually per
|
||||||
down a machine due to excess memory consumption. With monitoring
|
job are described in detail in the section on
|
||||||
enabled, every task is assigned a task-limit for virtual memory (VMEM).
|
<a href="ext:mapred-tutorial/ConfiguringMemoryRequirements">
|
||||||
In addition, every node is assigned a node-limit for VMEM usage.
|
Configuring Memory Requirements For A Job</a> in the MapReduce
|
||||||
A TT ensures that a task is killed if it, and
|
tutorial. While setting up
|
||||||
its descendants, use VMEM over the task's per-task limit. It also
|
the cluster, the Hadoop defaults for these options can be reviewed
|
||||||
ensures that one or more tasks are killed if the sum total of VMEM
|
and changed to better suit the job profiles expected to be run on
|
||||||
usage by all tasks, and their descendents, cross the node-limit.</p>
|
the clusters, as also the hardware configuration.
|
||||||
|
</p>
|
||||||
<p>Users can, optionally, specify the VMEM task-limit per job. If no
|
<p>
|
||||||
such limit is provided, a default limit is used. A node-limit can be
|
As with any other configuration option in Hadoop, if the
|
||||||
set per node.</p>
|
administrators desire to prevent users from overriding these
|
||||||
<p>Currently the memory monitoring and management is only supported
|
options in jobs they submit, these values can be marked as
|
||||||
in Linux platform.</p>
|
<em>final</em> in the cluster configuration.
|
||||||
<p>To enable monitoring for a TT, the
|
</p>
|
||||||
following parameters all need to be set:</p>
|
|
||||||
|
|
||||||
<table>
|
|
||||||
<tr><th>Name</th><th>Type</th><th>Description</th></tr>
|
|
||||||
<tr><td>mapred.tasktracker.vmem.reserved</td><td>long</td>
|
|
||||||
<td>A number, in bytes, that represents an offset. The total VMEM on
|
|
||||||
the machine, minus this offset, is the VMEM node-limit for all
|
|
||||||
tasks, and their descendants, spawned by the TT.
|
|
||||||
</td></tr>
|
|
||||||
<tr><td>mapred.task.default.maxvmem</td><td>long</td>
|
|
||||||
<td>A number, in bytes, that represents the default VMEM task-limit
|
|
||||||
associated with a task. Unless overridden by a job's setting,
|
|
||||||
this number defines the VMEM task-limit.
|
|
||||||
</td></tr>
|
|
||||||
<tr><td>mapred.task.limit.maxvmem</td><td>long</td>
|
|
||||||
<td>A number, in bytes, that represents the upper VMEM task-limit
|
|
||||||
associated with a task. Users, when specifying a VMEM task-limit
|
|
||||||
for their tasks, should not specify a limit which exceeds this amount.
|
|
||||||
</td></tr>
|
|
||||||
</table>
|
|
||||||
|
|
||||||
<p>In addition, the following parameters can also be configured.</p>
|
|
||||||
|
|
||||||
<table>
|
|
||||||
<tr><th>Name</th><th>Type</th><th>Description</th></tr>
|
|
||||||
<tr><td>mapreduce.tasktracker.taskmemorymanager.monitoringinterval</td>
|
|
||||||
<td>long</td>
|
|
||||||
<td>The time interval, in milliseconds, between which the TT
|
|
||||||
checks for any memory violation. The default value is 5000 msec
|
|
||||||
(5 seconds).
|
|
||||||
</td></tr>
|
|
||||||
</table>
|
|
||||||
|
|
||||||
<p>Here's how the memory monitoring works for a TT.</p>
|
|
||||||
<ol>
|
|
||||||
<li>If one or more of the configuration parameters described
|
|
||||||
above are missing or -1 is specified , memory monitoring is
|
|
||||||
disabled for the TT.
|
|
||||||
</li>
|
|
||||||
<li>In addition, monitoring is disabled if
|
|
||||||
<code>mapred.task.default.maxvmem</code> is greater than
|
|
||||||
<code>mapred.task.limit.maxvmem</code>.
|
|
||||||
</li>
|
|
||||||
<li>If a TT receives a task whose task-limit is set by the user
|
|
||||||
to a value larger than <code>mapred.task.limit.maxvmem</code>, it
|
|
||||||
logs a warning but executes the task.
|
|
||||||
</li>
|
|
||||||
<li>Periodically, the TT checks the following:
|
|
||||||
<ul>
|
|
||||||
<li>If any task's current VMEM usage is greater than that task's
|
|
||||||
VMEM task-limit, the task is killed and reason for killing
|
|
||||||
the task is logged in task diagonistics . Such a task is considered
|
|
||||||
failed, i.e., the killing counts towards the task's failure count.
|
|
||||||
</li>
|
|
||||||
<li>If the sum total of VMEM used by all tasks and descendants is
|
|
||||||
greater than the node-limit, the TT kills enough tasks, in the
|
|
||||||
order of least progress made, till the overall VMEM usage falls
|
|
||||||
below the node-limt. Such killed tasks are not considered failed
|
|
||||||
and their killing does not count towards the tasks' failure counts.
|
|
||||||
</li>
|
|
||||||
</ul>
|
|
||||||
</li>
|
|
||||||
</ol>
|
|
||||||
|
|
||||||
<p>Schedulers can choose to ease the monitoring pressure on the TT by
|
|
||||||
preventing too many tasks from running on a node and by scheduling
|
|
||||||
tasks only if the TT has enough VMEM free. In addition, Schedulers may
|
|
||||||
choose to consider the physical memory (RAM) available on the node
|
|
||||||
as well. To enable Scheduler support, TTs report their memory settings
|
|
||||||
to the JobTracker in every heartbeat. Before getting into details,
|
|
||||||
consider the following additional memory-related parameters than can be
|
|
||||||
configured to enable better scheduling:</p>
|
|
||||||
|
|
||||||
<table>
|
|
||||||
<tr><th>Name</th><th>Type</th><th>Description</th></tr>
|
|
||||||
<tr><td>mapred.tasktracker.pmem.reserved</td><td>int</td>
|
|
||||||
<td>A number, in bytes, that represents an offset. The total
|
|
||||||
physical memory (RAM) on the machine, minus this offset, is the
|
|
||||||
recommended RAM node-limit. The RAM node-limit is a hint to a
|
|
||||||
Scheduler to scheduler only so many tasks such that the sum
|
|
||||||
total of their RAM requirements does not exceed this limit.
|
|
||||||
RAM usage is not monitored by a TT.
|
|
||||||
</td></tr>
|
|
||||||
</table>
|
|
||||||
|
|
||||||
<p>A TT reports the following memory-related numbers in every
|
|
||||||
heartbeat:</p>
|
|
||||||
<ul>
|
|
||||||
<li>The total VMEM available on the node.</li>
|
|
||||||
<li>The value of <code>mapred.tasktracker.vmem.reserved</code>,
|
|
||||||
if set.</li>
|
|
||||||
<li>The total RAM available on the node.</li>
|
|
||||||
<li>The value of <code>mapred.tasktracker.pmem.reserved</code>,
|
|
||||||
if set.</li>
|
|
||||||
</ul>
|
|
||||||
</section>
|
</section>
|
||||||
|
|
||||||
|
|
||||||
|
<section>
|
||||||
|
<title>Cluster Specific Options</title>
|
||||||
|
|
||||||
|
<p>
|
||||||
|
This section describes the memory related options that are
|
||||||
|
used by the JobTracker and TaskTrackers, and cannot be changed
|
||||||
|
by jobs. The values set for these options should be the same
|
||||||
|
for all the slave nodes in a cluster.
|
||||||
|
</p>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
|
||||||
|
<li>
|
||||||
|
<code>mapreduce.cluster.{map|reduce}memory.mb</code>: These
|
||||||
|
options define the default amount of virtual memory that should be
|
||||||
|
allocated for MapReduce tasks running in the cluster. They
|
||||||
|
typically match the default values set for the options
|
||||||
|
<code>mapreduce.{map|reduce}.memory.mb</code>. They help in the
|
||||||
|
calculation of the total amount of virtual memory available for
|
||||||
|
MapReduce tasks on a slave, using the following equation:<br/>
|
||||||
|
<em>Total virtual memory for all MapReduce tasks =
|
||||||
|
(mapreduce.cluster.mapmemory.mb *
|
||||||
|
mapreduce.tasktracker.map.tasks.maximum) +
|
||||||
|
(mapreduce.cluster.reducememory.mb *
|
||||||
|
mapreduce.tasktracker.reduce.tasks.maximum)</em><br/>
|
||||||
|
Typically, reduce tasks require more memory than map tasks.
|
||||||
|
Hence a higher value is recommended for
|
||||||
|
<em>mapreduce.cluster.reducememory.mb</em>. The value is
|
||||||
|
specified in MB. To set a value of 2GB for reduce tasks, set
|
||||||
|
<em>mapreduce.cluster.reducememory.mb</em> to 2048.
|
||||||
|
</li>
|
||||||
|
|
||||||
|
<li>
|
||||||
|
<code>mapreduce.jobtracker.max{map|reduce}memory.mb</code>:
|
||||||
|
These options define the maximum amount of virtual memory that
|
||||||
|
can be requested by jobs using the parameters
|
||||||
|
<code>mapreduce.{map|reduce}.memory.mb</code>. The system
|
||||||
|
will reject any job that is submitted requesting for more
|
||||||
|
memory than these limits. Typically, the values for these
|
||||||
|
options should be set to satisfy the following constraint:<br/>
|
||||||
|
<em>mapreduce.jobtracker.maxmapmemory.mb =
|
||||||
|
mapreduce.cluster.mapmemory.mb *
|
||||||
|
mapreduce.tasktracker.map.tasks.maximum<br/>
|
||||||
|
mapreduce.jobtracker.maxreducememory.mb =
|
||||||
|
mapreduce.cluster.reducememory.mb *
|
||||||
|
mapreduce.tasktracker.reduce.tasks.maximum</em><br/>
|
||||||
|
The value is specified in MB. If
|
||||||
|
<code>mapreduce.cluster.reducememory.mb</code> is set to 2GB and
|
||||||
|
there are 2 reduce slots configured in the slaves, the value
|
||||||
|
for <code>mapreduce.jobtracker.maxreducememory.mb</code> should
|
||||||
|
be set to 4096.
|
||||||
|
</li>
|
||||||
|
|
||||||
|
<li>
|
||||||
|
<code>mapreduce.tasktracker.reserved.physicalmemory.mb</code>:
|
||||||
|
This option defines the amount of physical memory that is
|
||||||
|
marked for system and daemon processes. Using this, the amount
|
||||||
|
of physical memory available for MapReduce tasks is calculated
|
||||||
|
using the following equation:<br/>
|
||||||
|
<em>Total physical memory for all MapReduce tasks =
|
||||||
|
Total physical memory available on the system -
|
||||||
|
mapreduce.tasktracker.reserved.physicalmemory.mb</em><br/>
|
||||||
|
The value is specified in MB. To set this value to 2GB,
|
||||||
|
specify the value as 2048.
|
||||||
|
</li>
|
||||||
|
|
||||||
|
<li>
|
||||||
|
<code>mapreduce.tasktracker.taskmemorymanager.monitoringinterval</code>:
|
||||||
|
This option defines the time the TaskTracker waits between
|
||||||
|
two cycles of memory monitoring. The value is specified in
|
||||||
|
milliseconds.
|
||||||
|
</li>
|
||||||
|
|
||||||
|
</ul>
|
||||||
|
|
||||||
|
<p>
|
||||||
|
<em>Note:</em> The virtual memory monitoring function is only
|
||||||
|
enabled if
|
||||||
|
the variables <code>mapreduce.cluster.{map|reduce}memory.mb</code>
|
||||||
|
and <code>mapreduce.jobtracker.max{map|reduce}memory.mb</code>
|
||||||
|
are set to values greater than zero. Likewise, the physical
|
||||||
|
memory monitoring function is only enabled if the variable
|
||||||
|
<code>mapreduce.tasktracker.reserved.physicalmemory.mb</code>
|
||||||
|
is set to a value greater than zero.
|
||||||
|
</p>
|
||||||
|
</section>
|
||||||
|
</section>
|
||||||
|
|
||||||
|
|
||||||
<section>
|
<section>
|
||||||
<title>Task Controllers</title>
|
<title>Task Controllers</title>
|
||||||
<p>Task controllers are classes in the Hadoop Map/Reduce
|
<p>Task controllers are classes in the Hadoop Map/Reduce
|
||||||
|
|
|
@ -72,9 +72,12 @@ See http://forrest.apache.org/docs/linking.html for more info.
|
||||||
<mapred-default href="http://hadoop.apache.org/mapreduce/docs/current/mapred-default.html" />
|
<mapred-default href="http://hadoop.apache.org/mapreduce/docs/current/mapred-default.html" />
|
||||||
|
|
||||||
<mapred-queues href="http://hadoop.apache.org/mapreduce/docs/current/mapred_queues.xml" />
|
<mapred-queues href="http://hadoop.apache.org/mapreduce/docs/current/mapred_queues.xml" />
|
||||||
<capacity-scheduler href="http://hadoop.apache.org/mapreduce/docs/current/capacity_scheduler.html" />
|
<capacity-scheduler href="http://hadoop.apache.org/mapreduce/docs/current/capacity_scheduler.html">
|
||||||
|
<MemoryBasedTaskScheduling href="#Scheduling+Tasks+Considering+Memory+Requirements" />
|
||||||
|
</capacity-scheduler>
|
||||||
<mapred-tutorial href="http://hadoop.apache.org/mapreduce/docs/current/mapred_tutorial.html" >
|
<mapred-tutorial href="http://hadoop.apache.org/mapreduce/docs/current/mapred_tutorial.html" >
|
||||||
<JobAuthorization href="#Job+Authorization" />
|
<JobAuthorization href="#Job+Authorization" />
|
||||||
|
<ConfiguringMemoryRequirements href="#Configuring+Memory+Requirements+For+A+Job" />
|
||||||
</mapred-tutorial>
|
</mapred-tutorial>
|
||||||
<streaming href="http://hadoop.apache.org/mapreduce/docs/current/streaming.html" />
|
<streaming href="http://hadoop.apache.org/mapreduce/docs/current/streaming.html" />
|
||||||
<distcp href="http://hadoop.apache.org/mapreduce/docs/current/distcp.html" />
|
<distcp href="http://hadoop.apache.org/mapreduce/docs/current/distcp.html" />
|
||||||
|
|
Loading…
Reference in New Issue