From dcc1ab15ab4276a36fdb4f5376e65880ed8349ec Mon Sep 17 00:00:00 2001 From: Jason Lowe Date: Fri, 7 Nov 2014 23:40:22 +0000 Subject: [PATCH] YARN-2632. Document NM Restart feature. Contributed by Junping Du and Vinod Kumar Vavilapalli (cherry picked from commit 1e215e8ba2e801eb26f16c307daee756d6b2ca66) --- hadoop-project/src/site/site.xml | 1 + hadoop-yarn-project/CHANGES.txt | 3 + .../src/site/apt/NodeManagerRestart.apt.vm | 86 +++++++++++++++++++ 3 files changed, 90 insertions(+) create mode 100644 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/NodeManagerRestart.apt.vm diff --git a/hadoop-project/src/site/site.xml b/hadoop-project/src/site/site.xml index 2fd15321d89..4a2c221b753 100644 --- a/hadoop-project/src/site/site.xml +++ b/hadoop-project/src/site/site.xml @@ -124,6 +124,7 @@ + diff --git a/hadoop-yarn-project/CHANGES.txt b/hadoop-yarn-project/CHANGES.txt index 9e09317deb3..12c9af9e82d 100644 --- a/hadoop-yarn-project/CHANGES.txt +++ b/hadoop-yarn-project/CHANGES.txt @@ -123,6 +123,9 @@ Release 2.6.0 - UNRELEASED YARN-2647. Added a queue CLI for getting queue information. (Sunil Govind via vinodkv) + YARN-2632. Document NM Restart feature. (Junping Du and Vinod Kumar + Vavilapalli via jlowe) + IMPROVEMENTS YARN-2242. Improve exception information on AM launch crashes. (Li Lu diff --git a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/NodeManagerRestart.apt.vm b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/NodeManagerRestart.apt.vm new file mode 100644 index 00000000000..ba03f4e8f6e --- /dev/null +++ b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/NodeManagerRestart.apt.vm @@ -0,0 +1,86 @@ +~~ Licensed under the Apache License, Version 2.0 (the "License"); +~~ you may not use this file except in compliance with the License. +~~ You may obtain a copy of the License at +~~ +~~ http://www.apache.org/licenses/LICENSE-2.0 +~~ +~~ Unless required by applicable law or agreed to in writing, software +~~ distributed under the License is distributed on an "AS IS" BASIS, +~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +~~ See the License for the specific language governing permissions and +~~ limitations under the License. See accompanying LICENSE file. + + --- + NodeManager Restart + --- + --- + ${maven.build.timestamp} + +NodeManager Restart + +* Introduction + + This document gives an overview of NodeManager (NM) restart, a feature that + enables the NodeManager to be restarted without losing + the active containers running on the node. At a high level, the NM stores any + necessary state to a local state-store as it processes container-management + requests. When the NM restarts, it recovers by first loading state for + various subsystems and then letting those subsystems perform recovery using + the loaded state. + +* Enabling NM Restart + + [[1]] To enable NM Restart functionality, set the following property in <> to true: + +*--------------------------------------+--------------------------------------+ +|| Property || Value | +*--------------------------------------+--------------------------------------+ +| <<>> | | +| | <<>>, (default value is set to false) | +*--------------------------------------+--------------------------------------+ + + [[2]] Configure a path to the local file-system directory where the + NodeManager can save its run state + +*--------------------------------------+--------------------------------------+ +|| Property || Description | +*--------------------------------------+--------------------------------------+ +| <<>> | | +| | The local filesystem directory in which the node manager will store state | +| | when recovery is enabled. | +| | The default value is set to | +| | <<<${hadoop.tmp.dir}/yarn-nm-recovery>>>. | +*--------------------------------------+--------------------------------------+ + + [[3]] Configure a valid RPC address for the NodeManager + +*--------------------------------------+--------------------------------------+ +|| Property || Description | +*--------------------------------------+--------------------------------------+ +| <<>> | | +| | Ephemeral ports (port 0, which is default) cannot be used for the | +| | NodeManager's RPC server specified via yarn.nodemanager.address as it can | +| | make NM use different ports before and after a restart. This will break any | +| | previously running clients that were communicating with the NM before | +| | restart. Explicitly setting yarn.nodemanager.address to an address with | +| | specific port number (for e.g 0.0.0.0:45454) is a precondition for enabling | +| | NM restart. | +*--------------------------------------+--------------------------------------+ + + [[4]] Auxiliary services + + NodeManagers in a YARN cluster can be configured to run auxiliary services. + For a completely functional NM restart, YARN relies on any auxiliary service + configured to also support recovery. This usually includes (1) avoiding usage + of ephemeral ports so that previously running clients (in this case, usually + containers) are not disrupted after restart and (2) having the auxiliary + service itself support recoverability by reloading any previous state when + NodeManager restarts and reinitializes the auxiliary service. + + A simple example for the above is the auxiliary service 'ShuffleHandler' for + MapReduce (MR). ShuffleHandler respects the above two requirements already, + so users/admins don't have do anything for it to support NM restart: (1) The + configuration property <> controls which port the + ShuffleHandler on a NodeManager host binds to, and it defaults to a + non-ephemeral port. (2) The ShuffleHandler service also already supports + recovery of previous state after NM restarts. \ No newline at end of file