YARN-2632. Document NM Restart feature. Contributed by Junping Du and Vinod Kumar Vavilapalli

(cherry picked from commit 1e215e8ba2)
This commit is contained in:
Jason Lowe 2014-11-07 23:40:22 +00:00
parent e5c0fa2c95
commit dcc1ab15ab
3 changed files with 90 additions and 0 deletions

View File

@ -124,6 +124,7 @@
<item name="Writing YARN Applications" href="hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html"/> <item name="Writing YARN Applications" href="hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html"/>
<item name="YARN Commands" href="hadoop-yarn/hadoop-yarn-site/YarnCommands.html"/> <item name="YARN Commands" href="hadoop-yarn/hadoop-yarn-site/YarnCommands.html"/>
<item name="Scheduler Load Simulator" href="hadoop-sls/SchedulerLoadSimulator.html"/> <item name="Scheduler Load Simulator" href="hadoop-sls/SchedulerLoadSimulator.html"/>
<item name="NodeManager Restart" href="hadoop-yarn/hadoop-yarn-site/NodeManagerRestart.html"/>
</menu> </menu>
<menu name="YARN REST APIs" inherit="top"> <menu name="YARN REST APIs" inherit="top">

View File

@ -123,6 +123,9 @@ Release 2.6.0 - UNRELEASED
YARN-2647. Added a queue CLI for getting queue information. (Sunil Govind via YARN-2647. Added a queue CLI for getting queue information. (Sunil Govind via
vinodkv) vinodkv)
YARN-2632. Document NM Restart feature. (Junping Du and Vinod Kumar
Vavilapalli via jlowe)
IMPROVEMENTS IMPROVEMENTS
YARN-2242. Improve exception information on AM launch crashes. (Li Lu YARN-2242. Improve exception information on AM launch crashes. (Li Lu

View File

@ -0,0 +1,86 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
NodeManager Restart
---
---
${maven.build.timestamp}
NodeManager Restart
* Introduction
This document gives an overview of NodeManager (NM) restart, a feature that
enables the NodeManager to be restarted without losing
the active containers running on the node. At a high level, the NM stores any
necessary state to a local state-store as it processes container-management
requests. When the NM restarts, it recovers by first loading state for
various subsystems and then letting those subsystems perform recovery using
the loaded state.
* Enabling NM Restart
[[1]] To enable NM Restart functionality, set the following property in <<conf/yarn-site.xml>> to true:
*--------------------------------------+--------------------------------------+
|| Property || Value |
*--------------------------------------+--------------------------------------+
| <<<yarn.nodemanager.recovery.enabled>>> | |
| | <<<true>>>, (default value is set to false) |
*--------------------------------------+--------------------------------------+
[[2]] Configure a path to the local file-system directory where the
NodeManager can save its run state
*--------------------------------------+--------------------------------------+
|| Property || Description |
*--------------------------------------+--------------------------------------+
| <<<yarn.nodemanager.recovery.dir>>> | |
| | The local filesystem directory in which the node manager will store state |
| | when recovery is enabled. |
| | The default value is set to |
| | <<<${hadoop.tmp.dir}/yarn-nm-recovery>>>. |
*--------------------------------------+--------------------------------------+
[[3]] Configure a valid RPC address for the NodeManager
*--------------------------------------+--------------------------------------+
|| Property || Description |
*--------------------------------------+--------------------------------------+
| <<<yarn.nodemanager.address>>> | |
| | Ephemeral ports (port 0, which is default) cannot be used for the |
| | NodeManager's RPC server specified via yarn.nodemanager.address as it can |
| | make NM use different ports before and after a restart. This will break any |
| | previously running clients that were communicating with the NM before |
| | restart. Explicitly setting yarn.nodemanager.address to an address with |
| | specific port number (for e.g 0.0.0.0:45454) is a precondition for enabling |
| | NM restart. |
*--------------------------------------+--------------------------------------+
[[4]] Auxiliary services
NodeManagers in a YARN cluster can be configured to run auxiliary services.
For a completely functional NM restart, YARN relies on any auxiliary service
configured to also support recovery. This usually includes (1) avoiding usage
of ephemeral ports so that previously running clients (in this case, usually
containers) are not disrupted after restart and (2) having the auxiliary
service itself support recoverability by reloading any previous state when
NodeManager restarts and reinitializes the auxiliary service.
A simple example for the above is the auxiliary service 'ShuffleHandler' for
MapReduce (MR). ShuffleHandler respects the above two requirements already,
so users/admins don't have do anything for it to support NM restart: (1) The
configuration property <<mapreduce.shuffle.port>> controls which port the
ShuffleHandler on a NodeManager host binds to, and it defaults to a
non-ephemeral port. (2) The ShuffleHandler service also already supports
recovery of previous state after NM restarts.