YARN-2632. Document NM Restart feature. Contributed by Junping Du and Vinod Kumar Vavilapalli
(cherry picked from commit 1e215e8ba2
)
This commit is contained in:
parent
e5c0fa2c95
commit
dcc1ab15ab
|
@ -124,6 +124,7 @@
|
|||
<item name="Writing YARN Applications" href="hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html"/>
|
||||
<item name="YARN Commands" href="hadoop-yarn/hadoop-yarn-site/YarnCommands.html"/>
|
||||
<item name="Scheduler Load Simulator" href="hadoop-sls/SchedulerLoadSimulator.html"/>
|
||||
<item name="NodeManager Restart" href="hadoop-yarn/hadoop-yarn-site/NodeManagerRestart.html"/>
|
||||
</menu>
|
||||
|
||||
<menu name="YARN REST APIs" inherit="top">
|
||||
|
|
|
@ -123,6 +123,9 @@ Release 2.6.0 - UNRELEASED
|
|||
YARN-2647. Added a queue CLI for getting queue information. (Sunil Govind via
|
||||
vinodkv)
|
||||
|
||||
YARN-2632. Document NM Restart feature. (Junping Du and Vinod Kumar
|
||||
Vavilapalli via jlowe)
|
||||
|
||||
IMPROVEMENTS
|
||||
|
||||
YARN-2242. Improve exception information on AM launch crashes. (Li Lu
|
||||
|
|
|
@ -0,0 +1,86 @@
|
|||
~~ Licensed under the Apache License, Version 2.0 (the "License");
|
||||
~~ you may not use this file except in compliance with the License.
|
||||
~~ You may obtain a copy of the License at
|
||||
~~
|
||||
~~ http://www.apache.org/licenses/LICENSE-2.0
|
||||
~~
|
||||
~~ Unless required by applicable law or agreed to in writing, software
|
||||
~~ distributed under the License is distributed on an "AS IS" BASIS,
|
||||
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
~~ See the License for the specific language governing permissions and
|
||||
~~ limitations under the License. See accompanying LICENSE file.
|
||||
|
||||
---
|
||||
NodeManager Restart
|
||||
---
|
||||
---
|
||||
${maven.build.timestamp}
|
||||
|
||||
NodeManager Restart
|
||||
|
||||
* Introduction
|
||||
|
||||
This document gives an overview of NodeManager (NM) restart, a feature that
|
||||
enables the NodeManager to be restarted without losing
|
||||
the active containers running on the node. At a high level, the NM stores any
|
||||
necessary state to a local state-store as it processes container-management
|
||||
requests. When the NM restarts, it recovers by first loading state for
|
||||
various subsystems and then letting those subsystems perform recovery using
|
||||
the loaded state.
|
||||
|
||||
* Enabling NM Restart
|
||||
|
||||
[[1]] To enable NM Restart functionality, set the following property in <<conf/yarn-site.xml>> to true:
|
||||
|
||||
*--------------------------------------+--------------------------------------+
|
||||
|| Property || Value |
|
||||
*--------------------------------------+--------------------------------------+
|
||||
| <<<yarn.nodemanager.recovery.enabled>>> | |
|
||||
| | <<<true>>>, (default value is set to false) |
|
||||
*--------------------------------------+--------------------------------------+
|
||||
|
||||
[[2]] Configure a path to the local file-system directory where the
|
||||
NodeManager can save its run state
|
||||
|
||||
*--------------------------------------+--------------------------------------+
|
||||
|| Property || Description |
|
||||
*--------------------------------------+--------------------------------------+
|
||||
| <<<yarn.nodemanager.recovery.dir>>> | |
|
||||
| | The local filesystem directory in which the node manager will store state |
|
||||
| | when recovery is enabled. |
|
||||
| | The default value is set to |
|
||||
| | <<<${hadoop.tmp.dir}/yarn-nm-recovery>>>. |
|
||||
*--------------------------------------+--------------------------------------+
|
||||
|
||||
[[3]] Configure a valid RPC address for the NodeManager
|
||||
|
||||
*--------------------------------------+--------------------------------------+
|
||||
|| Property || Description |
|
||||
*--------------------------------------+--------------------------------------+
|
||||
| <<<yarn.nodemanager.address>>> | |
|
||||
| | Ephemeral ports (port 0, which is default) cannot be used for the |
|
||||
| | NodeManager's RPC server specified via yarn.nodemanager.address as it can |
|
||||
| | make NM use different ports before and after a restart. This will break any |
|
||||
| | previously running clients that were communicating with the NM before |
|
||||
| | restart. Explicitly setting yarn.nodemanager.address to an address with |
|
||||
| | specific port number (for e.g 0.0.0.0:45454) is a precondition for enabling |
|
||||
| | NM restart. |
|
||||
*--------------------------------------+--------------------------------------+
|
||||
|
||||
[[4]] Auxiliary services
|
||||
|
||||
NodeManagers in a YARN cluster can be configured to run auxiliary services.
|
||||
For a completely functional NM restart, YARN relies on any auxiliary service
|
||||
configured to also support recovery. This usually includes (1) avoiding usage
|
||||
of ephemeral ports so that previously running clients (in this case, usually
|
||||
containers) are not disrupted after restart and (2) having the auxiliary
|
||||
service itself support recoverability by reloading any previous state when
|
||||
NodeManager restarts and reinitializes the auxiliary service.
|
||||
|
||||
A simple example for the above is the auxiliary service 'ShuffleHandler' for
|
||||
MapReduce (MR). ShuffleHandler respects the above two requirements already,
|
||||
so users/admins don't have do anything for it to support NM restart: (1) The
|
||||
configuration property <<mapreduce.shuffle.port>> controls which port the
|
||||
ShuffleHandler on a NodeManager host binds to, and it defaults to a
|
||||
non-ephemeral port. (2) The ShuffleHandler service also already supports
|
||||
recovery of previous state after NM restarts.
|
Loading…
Reference in New Issue