YARN-2632. Document NM Restart feature. Contributed by Junping Du and Vinod Kumar Vavilapalli
(cherry picked from commit 1e215e8ba2
)
This commit is contained in:
parent
e5c0fa2c95
commit
dcc1ab15ab
|
@ -124,6 +124,7 @@
|
||||||
<item name="Writing YARN Applications" href="hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html"/>
|
<item name="Writing YARN Applications" href="hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html"/>
|
||||||
<item name="YARN Commands" href="hadoop-yarn/hadoop-yarn-site/YarnCommands.html"/>
|
<item name="YARN Commands" href="hadoop-yarn/hadoop-yarn-site/YarnCommands.html"/>
|
||||||
<item name="Scheduler Load Simulator" href="hadoop-sls/SchedulerLoadSimulator.html"/>
|
<item name="Scheduler Load Simulator" href="hadoop-sls/SchedulerLoadSimulator.html"/>
|
||||||
|
<item name="NodeManager Restart" href="hadoop-yarn/hadoop-yarn-site/NodeManagerRestart.html"/>
|
||||||
</menu>
|
</menu>
|
||||||
|
|
||||||
<menu name="YARN REST APIs" inherit="top">
|
<menu name="YARN REST APIs" inherit="top">
|
||||||
|
|
|
@ -123,6 +123,9 @@ Release 2.6.0 - UNRELEASED
|
||||||
YARN-2647. Added a queue CLI for getting queue information. (Sunil Govind via
|
YARN-2647. Added a queue CLI for getting queue information. (Sunil Govind via
|
||||||
vinodkv)
|
vinodkv)
|
||||||
|
|
||||||
|
YARN-2632. Document NM Restart feature. (Junping Du and Vinod Kumar
|
||||||
|
Vavilapalli via jlowe)
|
||||||
|
|
||||||
IMPROVEMENTS
|
IMPROVEMENTS
|
||||||
|
|
||||||
YARN-2242. Improve exception information on AM launch crashes. (Li Lu
|
YARN-2242. Improve exception information on AM launch crashes. (Li Lu
|
||||||
|
|
|
@ -0,0 +1,86 @@
|
||||||
|
~~ Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
~~ you may not use this file except in compliance with the License.
|
||||||
|
~~ You may obtain a copy of the License at
|
||||||
|
~~
|
||||||
|
~~ http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
~~
|
||||||
|
~~ Unless required by applicable law or agreed to in writing, software
|
||||||
|
~~ distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
~~ See the License for the specific language governing permissions and
|
||||||
|
~~ limitations under the License. See accompanying LICENSE file.
|
||||||
|
|
||||||
|
---
|
||||||
|
NodeManager Restart
|
||||||
|
---
|
||||||
|
---
|
||||||
|
${maven.build.timestamp}
|
||||||
|
|
||||||
|
NodeManager Restart
|
||||||
|
|
||||||
|
* Introduction
|
||||||
|
|
||||||
|
This document gives an overview of NodeManager (NM) restart, a feature that
|
||||||
|
enables the NodeManager to be restarted without losing
|
||||||
|
the active containers running on the node. At a high level, the NM stores any
|
||||||
|
necessary state to a local state-store as it processes container-management
|
||||||
|
requests. When the NM restarts, it recovers by first loading state for
|
||||||
|
various subsystems and then letting those subsystems perform recovery using
|
||||||
|
the loaded state.
|
||||||
|
|
||||||
|
* Enabling NM Restart
|
||||||
|
|
||||||
|
[[1]] To enable NM Restart functionality, set the following property in <<conf/yarn-site.xml>> to true:
|
||||||
|
|
||||||
|
*--------------------------------------+--------------------------------------+
|
||||||
|
|| Property || Value |
|
||||||
|
*--------------------------------------+--------------------------------------+
|
||||||
|
| <<<yarn.nodemanager.recovery.enabled>>> | |
|
||||||
|
| | <<<true>>>, (default value is set to false) |
|
||||||
|
*--------------------------------------+--------------------------------------+
|
||||||
|
|
||||||
|
[[2]] Configure a path to the local file-system directory where the
|
||||||
|
NodeManager can save its run state
|
||||||
|
|
||||||
|
*--------------------------------------+--------------------------------------+
|
||||||
|
|| Property || Description |
|
||||||
|
*--------------------------------------+--------------------------------------+
|
||||||
|
| <<<yarn.nodemanager.recovery.dir>>> | |
|
||||||
|
| | The local filesystem directory in which the node manager will store state |
|
||||||
|
| | when recovery is enabled. |
|
||||||
|
| | The default value is set to |
|
||||||
|
| | <<<${hadoop.tmp.dir}/yarn-nm-recovery>>>. |
|
||||||
|
*--------------------------------------+--------------------------------------+
|
||||||
|
|
||||||
|
[[3]] Configure a valid RPC address for the NodeManager
|
||||||
|
|
||||||
|
*--------------------------------------+--------------------------------------+
|
||||||
|
|| Property || Description |
|
||||||
|
*--------------------------------------+--------------------------------------+
|
||||||
|
| <<<yarn.nodemanager.address>>> | |
|
||||||
|
| | Ephemeral ports (port 0, which is default) cannot be used for the |
|
||||||
|
| | NodeManager's RPC server specified via yarn.nodemanager.address as it can |
|
||||||
|
| | make NM use different ports before and after a restart. This will break any |
|
||||||
|
| | previously running clients that were communicating with the NM before |
|
||||||
|
| | restart. Explicitly setting yarn.nodemanager.address to an address with |
|
||||||
|
| | specific port number (for e.g 0.0.0.0:45454) is a precondition for enabling |
|
||||||
|
| | NM restart. |
|
||||||
|
*--------------------------------------+--------------------------------------+
|
||||||
|
|
||||||
|
[[4]] Auxiliary services
|
||||||
|
|
||||||
|
NodeManagers in a YARN cluster can be configured to run auxiliary services.
|
||||||
|
For a completely functional NM restart, YARN relies on any auxiliary service
|
||||||
|
configured to also support recovery. This usually includes (1) avoiding usage
|
||||||
|
of ephemeral ports so that previously running clients (in this case, usually
|
||||||
|
containers) are not disrupted after restart and (2) having the auxiliary
|
||||||
|
service itself support recoverability by reloading any previous state when
|
||||||
|
NodeManager restarts and reinitializes the auxiliary service.
|
||||||
|
|
||||||
|
A simple example for the above is the auxiliary service 'ShuffleHandler' for
|
||||||
|
MapReduce (MR). ShuffleHandler respects the above two requirements already,
|
||||||
|
so users/admins don't have do anything for it to support NM restart: (1) The
|
||||||
|
configuration property <<mapreduce.shuffle.port>> controls which port the
|
||||||
|
ShuffleHandler on a NodeManager host binds to, and it defaults to a
|
||||||
|
non-ephemeral port. (2) The ShuffleHandler service also already supports
|
||||||
|
recovery of previous state after NM restarts.
|
Loading…
Reference in New Issue