YARN-3168. Convert site documentation from apt to markdown (Gururaj Shetty via aw)

This commit is contained in:
Allen Wittenauer 2015-02-27 20:39:44 -08:00
parent edcecedc1c
commit 2e44b75f72
36 changed files with 5763 additions and 7502 deletions

View File

@ -20,6 +20,9 @@ Trunk - Unreleased
YARN-2980. Move health check script related functionality to hadoop-common
(Varun Saxena via aw)
YARN-3168. Convert site documentation from apt to markdown (Gururaj Shetty
via aw)
OPTIMIZATIONS
BUG FIXES

View File

@ -1,368 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
Hadoop Map Reduce Next Generation-${project.version} - Capacity Scheduler
---
---
${maven.build.timestamp}
Hadoop MapReduce Next Generation - Capacity Scheduler
%{toc|section=1|fromDepth=0}
* {Purpose}
This document describes the <<<CapacityScheduler>>>, a pluggable scheduler
for Hadoop which allows for multiple-tenants to securely share a large cluster
such that their applications are allocated resources in a timely manner under
constraints of allocated capacities.
* {Overview}
The <<<CapacityScheduler>>> is designed to run Hadoop applications as a
shared, multi-tenant cluster in an operator-friendly manner while maximizing
the throughput and the utilization of the cluster.
Traditionally each organization has it own private set of compute resources
that have sufficient capacity to meet the organization's SLA under peak or
near peak conditions. This generally leads to poor average utilization and
overhead of managing multiple independent clusters, one per each organization.
Sharing clusters between organizations is a cost-effective manner of running
large Hadoop installations since this allows them to reap benefits of
economies of scale without creating private clusters. However, organizations
are concerned about sharing a cluster because they are worried about others
using the resources that are critical for their SLAs.
The <<<CapacityScheduler>>> is designed to allow sharing a large cluster while
giving each organization capacity guarantees. The central idea is
that the available resources in the Hadoop cluster are shared among multiple
organizations who collectively fund the cluster based on their computing
needs. There is an added benefit that an organization can access
any excess capacity not being used by others. This provides elasticity for
the organizations in a cost-effective manner.
Sharing clusters across organizations necessitates strong support for
multi-tenancy since each organization must be guaranteed capacity and
safe-guards to ensure the shared cluster is impervious to single rouge
application or user or sets thereof. The <<<CapacityScheduler>>> provides a
stringent set of limits to ensure that a single application or user or queue
cannot consume disproportionate amount of resources in the cluster. Also, the
<<<CapacityScheduler>>> provides limits on initialized/pending applications
from a single user and queue to ensure fairness and stability of the cluster.
The primary abstraction provided by the <<<CapacityScheduler>>> is the concept
of <queues>. These queues are typically setup by administrators to reflect the
economics of the shared cluster.
To provide further control and predictability on sharing of resources, the
<<<CapacityScheduler>>> supports <hierarchical queues> to ensure
resources are shared among the sub-queues of an organization before other
queues are allowed to use free resources, there-by providing <affinity>
for sharing free resources among applications of a given organization.
* {Features}
The <<<CapacityScheduler>>> supports the following features:
* Hierarchical Queues - Hierarchy of queues is supported to ensure resources
are shared among the sub-queues of an organization before other
queues are allowed to use free resources, there-by providing more control
and predictability.
* Capacity Guarantees - Queues are allocated a fraction of the capacity of the
grid in the sense that a certain capacity of resources will be at their
disposal. All applications submitted to a queue will have access to the
capacity allocated to the queue. Adminstrators can configure soft limits and
optional hard limits on the capacity allocated to each queue.
* Security - Each queue has strict ACLs which controls which users can submit
applications to individual queues. Also, there are safe-guards to ensure
that users cannot view and/or modify applications from other users.
Also, per-queue and system administrator roles are supported.
* Elasticity - Free resources can be allocated to any queue beyond it's
capacity. When there is demand for these resources from queues running below
capacity at a future point in time, as tasks scheduled on these resources
complete, they will be assigned to applications on queues running below the
capacity (pre-emption is not supported). This ensures that resources are available
in a predictable and elastic manner to queues, thus preventing artifical silos
of resources in the cluster which helps utilization.
* Multi-tenancy - Comprehensive set of limits are provided to prevent a
single application, user and queue from monopolizing resources of the queue
or the cluster as a whole to ensure that the cluster isn't overwhelmed.
* Operability
* Runtime Configuration - The queue definitions and properties such as
capacity, ACLs can be changed, at runtime, by administrators in a secure
manner to minimize disruption to users. Also, a console is provided for
users and administrators to view current allocation of resources to
various queues in the system. Administrators can <add additional queues>
at runtime, but queues cannot be <deleted> at runtime.
* Drain applications - Administrators can <stop> queues
at runtime to ensure that while existing applications run to completion,
no new applications can be submitted. If a queue is in <<<STOPPED>>>
state, new applications cannot be submitted to <itself> or
<any of its child queueus>. Existing applications continue to completion,
thus the queue can be <drained> gracefully. Administrators can also
<start> the stopped queues.
* Resource-based Scheduling - Support for resource-intensive applications,
where-in a application can optionally specify higher resource-requirements
than the default, there-by accomodating applications with differing resource
requirements. Currently, <memory> is the the resource requirement supported.
[]
* {Configuration}
* Setting up <<<ResourceManager>>> to use <<<CapacityScheduler>>>
To configure the <<<ResourceManager>>> to use the <<<CapacityScheduler>>>, set
the following property in the <<conf/yarn-site.xml>>:
*--------------------------------------+--------------------------------------+
|| Property || Value |
*--------------------------------------+--------------------------------------+
| <<<yarn.resourcemanager.scheduler.class>>> | |
| | <<<org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler>>> |
*--------------------------------------+--------------------------------------+
* Setting up <queues>
<<conf/capacity-scheduler.xml>> is the configuration file for the
<<<CapacityScheduler>>>.
The <<<CapacityScheduler>>> has a pre-defined queue called <root>. All
queueus in the system are children of the root queue.
Further queues can be setup by configuring
<<<yarn.scheduler.capacity.root.queues>>> with a list of comma-separated
child queues.
The configuration for <<<CapacityScheduler>>> uses a concept called
<queue path> to configure the hierarchy of queues. The <queue path> is the
full path of the queue's hierarchy, starting at <root>, with . (dot) as the
delimiter.
A given queue's children can be defined with the configuration knob:
<<<yarn.scheduler.capacity.<queue-path>.queues>>>. Children do not
inherit properties directly from the parent unless otherwise noted.
Here is an example with three top-level child-queues <<<a>>>, <<<b>>> and
<<<c>>> and some sub-queues for <<<a>>> and <<<b>>>:
----
<property>
<name>yarn.scheduler.capacity.root.queues</name>
<value>a,b,c</value>
<description>The queues at the this level (root is the root queue).
</description>
</property>
<property>
<name>yarn.scheduler.capacity.root.a.queues</name>
<value>a1,a2</value>
<description>The queues at the this level (root is the root queue).
</description>
</property>
<property>
<name>yarn.scheduler.capacity.root.b.queues</name>
<value>b1,b2,b3</value>
<description>The queues at the this level (root is the root queue).
</description>
</property>
----
* Queue Properties
* Resource Allocation
*--------------------------------------+--------------------------------------+
|| Property || Description |
*--------------------------------------+--------------------------------------+
| <<<yarn.scheduler.capacity.<queue-path>.capacity>>> | |
| | Queue <capacity> in percentage (%) as a float (e.g. 12.5).|
| | The sum of capacities for all queues, at each level, must be equal |
| | to 100. |
| | Applications in the queue may consume more resources than the queue's |
| | capacity if there are free resources, providing elasticity. |
*--------------------------------------+--------------------------------------+
| <<<yarn.scheduler.capacity.<queue-path>.maximum-capacity>>> | |
| | Maximum queue capacity in percentage (%) as a float. |
| | This limits the <elasticity> for applications in the queue. |
| | Defaults to -1 which disables it. |
*--------------------------------------+--------------------------------------+
| <<<yarn.scheduler.capacity.<queue-path>.minimum-user-limit-percent>>> | |
| | Each queue enforces a limit on the percentage of resources allocated to a |
| | user at any given time, if there is demand for resources. The user limit |
| | can vary between a minimum and maximum value. The the former |
| | (the minimum value) is set to this property value and the latter |
| | (the maximum value) depends on the number of users who have submitted |
| | applications. For e.g., suppose the value of this property is 25. |
| | If two users have submitted applications to a queue, no single user can |
| | use more than 50% of the queue resources. If a third user submits an |
| | application, no single user can use more than 33% of the queue resources. |
| | With 4 or more users, no user can use more than 25% of the queues |
| | resources. A value of 100 implies no user limits are imposed. The default |
| | is 100. Value is specified as a integer.|
*--------------------------------------+--------------------------------------+
| <<<yarn.scheduler.capacity.<queue-path>.user-limit-factor>>> | |
| | The multiple of the queue capacity which can be configured to allow a |
| | single user to acquire more resources. By default this is set to 1 which |
| | ensures that a single user can never take more than the queue's configured |
| | capacity irrespective of how idle th cluster is. Value is specified as |
| | a float.|
*--------------------------------------+--------------------------------------+
| <<<yarn.scheduler.capacity.<queue-path>.maximum-allocation-mb>>> | |
| | The per queue maximum limit of memory to allocate to each container |
| | request at the Resource Manager. This setting overrides the cluster |
| | configuration <<<yarn.scheduler.maximum-allocation-mb>>>. This value |
| | must be smaller than or equal to the cluster maximum. |
*--------------------------------------+--------------------------------------+
| <<<yarn.scheduler.capacity.<queue-path>.maximum-allocation-vcores>>> | |
| | The per queue maximum limit of virtual cores to allocate to each container |
| | request at the Resource Manager. This setting overrides the cluster |
| | configuration <<<yarn.scheduler.maximum-allocation-vcores>>>. This value |
| | must be smaller than or equal to the cluster maximum. |
*--------------------------------------+--------------------------------------+
* Running and Pending Application Limits
The <<<CapacityScheduler>>> supports the following parameters to control
the running and pending applications:
*--------------------------------------+--------------------------------------+
|| Property || Description |
*--------------------------------------+--------------------------------------+
| <<<yarn.scheduler.capacity.maximum-applications>>> / |
| <<<yarn.scheduler.capacity.<queue-path>.maximum-applications>>> | |
| | Maximum number of applications in the system which can be concurrently |
| | active both running and pending. Limits on each queue are directly |
| | proportional to their queue capacities and user limits. This is a
| | hard limit and any applications submitted when this limit is reached will |
| | be rejected. Default is 10000. This can be set for all queues with |
| | <<<yarn.scheduler.capacity.maximum-applications>>> and can also be overridden on a |
| | per queue basis by setting <<<yarn.scheduler.capacity.<queue-path>.maximum-applications>>>. |
| | Integer value expected.|
*--------------------------------------+--------------------------------------+
| <<<yarn.scheduler.capacity.maximum-am-resource-percent>>> / |
| <<<yarn.scheduler.capacity.<queue-path>.maximum-am-resource-percent>>> | |
| | Maximum percent of resources in the cluster which can be used to run |
| | application masters - controls number of concurrent active applications. Limits on each |
| | queue are directly proportional to their queue capacities and user limits. |
| | Specified as a float - ie 0.5 = 50%. Default is 10%. This can be set for all queues with |
| | <<<yarn.scheduler.capacity.maximum-am-resource-percent>>> and can also be overridden on a |
| | per queue basis by setting <<<yarn.scheduler.capacity.<queue-path>.maximum-am-resource-percent>>> |
*--------------------------------------+--------------------------------------+
* Queue Administration & Permissions
The <<<CapacityScheduler>>> supports the following parameters to
the administer the queues:
*--------------------------------------+--------------------------------------+
|| Property || Description |
*--------------------------------------+--------------------------------------+
| <<<yarn.scheduler.capacity.<queue-path>.state>>> | |
| | The <state> of the queue. Can be one of <<<RUNNING>>> or <<<STOPPED>>>. |
| | If a queue is in <<<STOPPED>>> state, new applications cannot be |
| | submitted to <itself> or <any of its child queues>. |
| | Thus, if the <root> queue is <<<STOPPED>>> no applications can be |
| | submitted to the entire cluster. |
| | Existing applications continue to completion, thus the queue can be
| | <drained> gracefully. Value is specified as Enumeration. |
*--------------------------------------+--------------------------------------+
| <<<yarn.scheduler.capacity.root.<queue-path>.acl_submit_applications>>> | |
| | The <ACL> which controls who can <submit> applications to the given queue. |
| | If the given user/group has necessary ACLs on the given queue or |
| | <one of the parent queues in the hierarchy> they can submit applications. |
| | <ACLs> for this property <are> inherited from the parent queue |
| | if not specified. |
*--------------------------------------+--------------------------------------+
| <<<yarn.scheduler.capacity.root.<queue-path>.acl_administer_queue>>> | |
| | The <ACL> which controls who can <administer> applications on the given queue. |
| | If the given user/group has necessary ACLs on the given queue or |
| | <one of the parent queues in the hierarchy> they can administer applications. |
| | <ACLs> for this property <are> inherited from the parent queue |
| | if not specified. |
*--------------------------------------+--------------------------------------+
<Note:> An <ACL> is of the form <user1>, <user2><space><group1>, <group2>.
The special value of <<*>> implies <anyone>. The special value of <space>
implies <no one>. The default is <<*>> for the root queue if not specified.
* Other Properties
* Resource Calculator
*--------------------------------------+--------------------------------------+
|| Property || Description |
*--------------------------------------+--------------------------------------+
| <<<yarn.scheduler.capacity.resource-calculator>>> | |
| | The ResourceCalculator implementation to be used to compare Resources in the |
| | scheduler. The default i.e. org.apache.hadoop.yarn.util.resource.DefaultResourseCalculator |
| | only uses Memory while DominantResourceCalculator uses Dominant-resource |
| | to compare multi-dimensional resources such as Memory, CPU etc. A Java |
| | ResourceCalculator class name is expected. |
*--------------------------------------+--------------------------------------+
* Data Locality
*--------------------------------------+--------------------------------------+
|| Property || Description |
*--------------------------------------+--------------------------------------+
| <<<yarn.scheduler.capacity.node-locality-delay>>> | |
| | Number of missed scheduling opportunities after which the CapacityScheduler |
| | attempts to schedule rack-local containers. Typically, this should be set to |
| | number of nodes in the cluster. By default is setting approximately number |
| | of nodes in one rack which is 40. Positive integer value is expected.|
*--------------------------------------+--------------------------------------+
* Reviewing the configuration of the CapacityScheduler
Once the installation and configuration is completed, you can review it
after starting the YARN cluster from the web-ui.
* Start the YARN cluster in the normal manner.
* Open the <<<ResourceManager>>> web UI.
* The </scheduler> web-page should show the resource usages of individual
queues.
[]
* {Changing Queue Configuration}
Changing queue properties and adding new queues is very simple. You need to
edit <<conf/capacity-scheduler.xml>> and run <yarn rmadmin -refreshQueues>.
----
$ vi $HADOOP_CONF_DIR/capacity-scheduler.xml
$ $HADOOP_YARN_HOME/bin/yarn rmadmin -refreshQueues
----
<Note:> Queues cannot be <deleted>, only addition of new queues is supported -
the updated queue configuration should be a valid one i.e. queue-capacity at
each <level> should be equal to 100%.

View File

@ -1,204 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
Hadoop Map Reduce Next Generation-${project.version} - Docker Container Executor
---
---
${maven.build.timestamp}
Docker Container Executor
%{toc|section=1|fromDepth=0}
* {Overview}
Docker (https://www.docker.io/) combines an easy-to-use interface to
Linux containers with easy-to-construct image files for those
containers. In short, Docker launches very light weight virtual
machines.
The Docker Container Executor (DCE) allows the YARN NodeManager to
launch YARN containers into Docker containers. Users can specify the
Docker images they want for their YARN containers. These containers
provide a custom software environment in which the user's code runs,
isolated from the software environment of the NodeManager. These
containers can include special libraries needed by the application,
and they can have different versions of Perl, Python, and even Java
than what is installed on the NodeManager. Indeed, these containers
can run a different flavor of Linux than what is running on the
NodeManager -- although the YARN container must define all the environments
and libraries needed to run the job, nothing will be shared with the NodeManager.
Docker for YARN provides both consistency (all YARN containers will
have the same software environment) and isolation (no interference
with whatever is installed on the physical machine).
* {Cluster Configuration}
Docker Container Executor runs in non-secure mode of HDFS and
YARN. It will not run in secure mode, and will exit if it detects
secure mode.
The DockerContainerExecutor requires Docker daemon to be running on
the NodeManagers, and the Docker client installed and able to start Docker
containers. To prevent timeouts while starting jobs, the Docker
images to be used by a job should already be downloaded in the
NodeManagers. Here's an example of how this can be done:
----
sudo docker pull sequenceiq/hadoop-docker:2.4.1
----
This should be done as part of the NodeManager startup.
The following properties must be set in yarn-site.xml:
----
<property>
<name>yarn.nodemanager.docker-container-executor.exec-name</name>
<value>/usr/bin/docker</value>
<description>
Name or path to the Docker client. This is a required parameter. If this is empty,
user must pass an image name as part of the job invocation(see below).
</description>
</property>
<property>
<name>yarn.nodemanager.container-executor.class</name>
<value>org.apache.hadoop.yarn.server.nodemanager.DockerContainerExecutor</value>
<description>
This is the container executor setting that ensures that all
jobs are started with the DockerContainerExecutor.
</description>
</property>
----
Administrators should be aware that DCE doesn't currently provide
user name-space isolation. This means, in particular, that software
running as root in the YARN container will have root privileges in the
underlying NodeManager. Put differently, DCE currently provides no
better security guarantees than YARN's Default Container Executor. In
fact, DockerContainerExecutor will exit if it detects secure yarn.
* {Tips for connecting to a secure docker repository}
By default, docker images are pulled from the docker public repository. The
format of a docker image url is: <username>/<image_name>. For example,
sequenceiq/hadoop-docker:2.4.1 is an image in docker public repository that contains java and
hadoop.
If you want your own private repository, you provide the repository url instead of
your username. Therefore, the image url becomes: <private_repo_url>/<image_name>.
For example, if your repository is on localhost:8080, your images would be like:
localhost:8080/hadoop-docker
To connect to a secure docker repository, you can use the following invocation:
----
docker login [OPTIONS] [SERVER]
Register or log in to a Docker registry server, if no server is specified
"https://index.docker.io/v1/" is the default.
-e, --email="" Email
-p, --password="" Password
-u, --username="" Username
----
If you want to login to a self-hosted registry you can specify this by adding
the server name.
----
docker login <private_repo_url>
----
This needs to be run as part of the NodeManager startup, or as a cron job if
the login session expires periodically. You can login to multiple docker repositories
from the same NodeManager, but all your users will have access to all your repositories,
as at present the DockerContainerExecutor does not support per-job docker login.
* {Job Configuration}
Currently you cannot configure any of the Docker settings with the job configuration.
You can provide Mapper, Reducer, and ApplicationMaster environment overrides for the
docker images, using the following 3 JVM properties respectively(only for MR jobs):
* mapreduce.map.env: You can override the mapper's image by passing
yarn.nodemanager.docker-container-executor.image-name=<your_image_name>
to this JVM property.
* mapreduce.reduce.env: You can override the reducer's image by passing
yarn.nodemanager.docker-container-executor.image-name=<your_image_name>
to this JVM property.
* yarn.app.mapreduce.am.env: You can override the ApplicationMaster's image
by passing yarn.nodemanager.docker-container-executor.image-name=<your_image_name>
to this JVM property.
* {Docker Image requirements}
The Docker Images used for YARN containers must meet the following
requirements:
The distro and version of Linux in your Docker Image can be quite different
from that of your NodeManager. (Docker does have a few limitations in this
regard, but you're not likely to hit them.) However, if you're using the
MapReduce framework, then your image will need to be configured for running
Hadoop. Java must be installed in the container, and the following environment variables
must be defined in the image: JAVA_HOME, HADOOP_COMMON_PATH, HADOOP_HDFS_HOME,
HADOOP_MAPRED_HOME, HADOOP_YARN_HOME, and HADOOP_CONF_DIR
* {Working example of yarn launched docker containers.}
The following example shows how to run teragen using DockerContainerExecutor.
* First ensure that YARN is properly configured with DockerContainerExecutor(see above).
----
<property>
<name>yarn.nodemanager.docker-container-executor.exec-name</name>
<value>docker -H=tcp://0.0.0.0:4243</value>
<description>
Name or path to the Docker client. The tcp socket must be
where docker daemon is listening.
</description>
</property>
<property>
<name>yarn.nodemanager.container-executor.class</name>
<value>org.apache.hadoop.yarn.server.nodemanager.DockerContainerExecutor</value>
<description>
This is the container executor setting that ensures that all
jobs are started with the DockerContainerExecutor.
</description>
</property>
----
* Pick a custom Docker image if you want. In this example, we'll use sequenceiq/hadoop-docker:2.4.1 from the
docker hub repository. It has jdk, hadoop, and all the previously mentioned environment variables configured.
* Run:
----
hadoop jar $HADOOP_INSTALLATION_DIR/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar \
teragen \
-Dmapreduce.map.env="yarn.nodemanager.docker-container-executor.image-name=sequenceiq/hadoop-docker:2.4.1" \
-Dyarn.app.mapreduce.am.env="yarn.nodemanager.docker-container-executor.image-name=sequenceiq/hadoop-docker:2.4.1" \
1000 \
teragen_out_dir
----
Once it succeeds, you can check the yarn debug logs to verify that docker indeed has launched containers.

View File

@ -1,483 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
Hadoop Map Reduce Next Generation-${project.version} - Fair Scheduler
---
---
${maven.build.timestamp}
Hadoop MapReduce Next Generation - Fair Scheduler
%{toc|section=1|fromDepth=0}
* {Purpose}
This document describes the <<<FairScheduler>>>, a pluggable scheduler for Hadoop
that allows YARN applications to share resources in large clusters fairly.
* {Introduction}
Fair scheduling is a method of assigning resources to applications such that
all apps get, on average, an equal share of resources over time.
Hadoop NextGen is capable of scheduling multiple resource types. By default,
the Fair Scheduler bases scheduling fairness decisions only on memory. It
can be configured to schedule with both memory and CPU, using the notion
of Dominant Resource Fairness developed by Ghodsi et al. When there is a
single app running, that app uses the entire cluster. When other apps are
submitted, resources that free up are assigned to the new apps, so that each
app eventually on gets roughly the same amount of resources. Unlike the default
Hadoop scheduler, which forms a queue of apps, this lets short apps finish in
reasonable time while not starving long-lived apps. It is also a reasonable way
to share a cluster between a number of users. Finally, fair sharing can also
work with app priorities - the priorities are used as weights to determine the
fraction of total resources that each app should get.
The scheduler organizes apps further into "queues", and shares resources
fairly between these queues. By default, all users share a single queue,
named "default". If an app specifically lists a queue in a container resource
request, the request is submitted to that queue. It is also possible to assign
queues based on the user name included with the request through
configuration. Within each queue, a scheduling policy is used to share
resources between the running apps. The default is memory-based fair sharing,
but FIFO and multi-resource with Dominant Resource Fairness can also be
configured. Queues can be arranged in a hierarchy to divide resources and
configured with weights to share the cluster in specific proportions.
In addition to providing fair sharing, the Fair Scheduler allows assigning
guaranteed minimum shares to queues, which is useful for ensuring that
certain users, groups or production applications always get sufficient
resources. When a queue contains apps, it gets at least its minimum share,
but when the queue does not need its full guaranteed share, the excess is
split between other running apps. This lets the scheduler guarantee capacity
for queues while utilizing resources efficiently when these queues don't
contain applications.
The Fair Scheduler lets all apps run by default, but it is also possible to
limit the number of running apps per user and per queue through the config
file. This can be useful when a user must submit hundreds of apps at once,
or in general to improve performance if running too many apps at once would
cause too much intermediate data to be created or too much context-switching.
Limiting the apps does not cause any subsequently submitted apps to fail,
only to wait in the scheduler's queue until some of the user's earlier apps
finish.
* {Hierarchical queues with pluggable policies}
The fair scheduler supports hierarchical queues. All queues descend from a
queue named "root". Available resources are distributed among the children
of the root queue in the typical fair scheduling fashion. Then, the children
distribute the resources assigned to them to their children in the same
fashion. Applications may only be scheduled on leaf queues. Queues can be
specified as children of other queues by placing them as sub-elements of
their parents in the fair scheduler allocation file.
A queue's name starts with the names of its parents, with periods as
separators. So a queue named "queue1" under the root queue, would be referred
to as "root.queue1", and a queue named "queue2" under a queue named "parent1"
would be referred to as "root.parent1.queue2". When referring to queues, the
root part of the name is optional, so queue1 could be referred to as just
"queue1", and a queue2 could be referred to as just "parent1.queue2".
Additionally, the fair scheduler allows setting a different custom policy for
each queue to allow sharing the queue's resources in any which way the user
wants. A custom policy can be built by extending
<<<org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.SchedulingPolicy>>>.
FifoPolicy, FairSharePolicy (default), and DominantResourceFairnessPolicy are
built-in and can be readily used.
Certain add-ons are not yet supported which existed in the original (MR1)
Fair Scheduler. Among them, is the use of a custom policies governing
priority "boosting" over certain apps.
* {Automatically placing applications in queues}
The Fair Scheduler allows administrators to configure policies that
automatically place submitted applications into appropriate queues. Placement
can depend on the user and groups of the submitter and the requested queue
passed by the application. A policy consists of a set of rules that are applied
sequentially to classify an incoming application. Each rule either places the
app into a queue, rejects it, or continues on to the next rule. Refer to the
allocation file format below for how to configure these policies.
* {Installation}
To use the Fair Scheduler first assign the appropriate scheduler class in
yarn-site.xml:
------
<property>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
</property>
------
* {Configuration}
Customizing the Fair Scheduler typically involves altering two files. First,
scheduler-wide options can be set by adding configuration properties in the
yarn-site.xml file in your existing configuration directory. Second, in
most cases users will want to create an allocation file listing which queues
exist and their respective weights and capacities. The allocation file
is reloaded every 10 seconds, allowing changes to be made on the fly.
Properties that can be placed in yarn-site.xml
* <<<yarn.scheduler.fair.allocation.file>>>
* Path to allocation file. An allocation file is an XML manifest describing
queues and their properties, in addition to certain policy defaults. This file
must be in the XML format described in the next section. If a relative path is
given, the file is searched for on the classpath (which typically includes
the Hadoop conf directory).
Defaults to fair-scheduler.xml.
* <<<yarn.scheduler.fair.user-as-default-queue>>>
* Whether to use the username associated with the allocation as the default
queue name, in the event that a queue name is not specified. If this is set
to "false" or unset, all jobs have a shared default queue, named "default".
Defaults to true. If a queue placement policy is given in the allocations
file, this property is ignored.
* <<<yarn.scheduler.fair.preemption>>>
* Whether to use preemption. Defaults to false.
* <<<yarn.scheduler.fair.preemption.cluster-utilization-threshold>>>
* The utilization threshold after which preemption kicks in. The
utilization is computed as the maximum ratio of usage to capacity among
all resources. Defaults to 0.8f.
* <<<yarn.scheduler.fair.sizebasedweight>>>
* Whether to assign shares to individual apps based on their size, rather than
providing an equal share to all apps regardless of size. When set to true,
apps are weighted by the natural logarithm of one plus the app's total
requested memory, divided by the natural logarithm of 2. Defaults to false.
* <<<yarn.scheduler.fair.assignmultiple>>>
* Whether to allow multiple container assignments in one heartbeat. Defaults
to false.
* <<<yarn.scheduler.fair.max.assign>>>
* If assignmultiple is true, the maximum amount of containers that can be
assigned in one heartbeat. Defaults to -1, which sets no limit.
* <<<yarn.scheduler.fair.locality.threshold.node>>>
* For applications that request containers on particular nodes, the number of
scheduling opportunities since the last container assignment to wait before
accepting a placement on another node. Expressed as a float between 0 and 1,
which, as a fraction of the cluster size, is the number of scheduling
opportunities to pass up. The default value of -1.0 means don't pass up any
scheduling opportunities.
* <<<yarn.scheduler.fair.locality.threshold.rack>>>
* For applications that request containers on particular racks, the number of
scheduling opportunities since the last container assignment to wait before
accepting a placement on another rack. Expressed as a float between 0 and 1,
which, as a fraction of the cluster size, is the number of scheduling
opportunities to pass up. The default value of -1.0 means don't pass up any
scheduling opportunities.
* <<<yarn.scheduler.fair.allow-undeclared-pools>>>
* If this is true, new queues can be created at application submission time,
whether because they are specified as the application's queue by the
submitter or because they are placed there by the user-as-default-queue
property. If this is false, any time an app would be placed in a queue that
is not specified in the allocations file, it is placed in the "default" queue
instead. Defaults to true. If a queue placement policy is given in the
allocations file, this property is ignored.
* <<<yarn.scheduler.fair.update-interval-ms>>>
* The interval at which to lock the scheduler and recalculate fair shares,
recalculate demand, and check whether anything is due for preemption.
Defaults to 500 ms.
Allocation file format
The allocation file must be in XML format. The format contains five types of
elements:
* <<Queue elements>>, which represent queues. Queue elements can take an optional
attribute 'type', which when set to 'parent' makes it a parent queue. This is useful
when we want to create a parent queue without configuring any leaf queues.
Each queue element may contain the following properties:
* minResources: minimum resources the queue is entitled to, in the form
"X mb, Y vcores". For the single-resource fairness policy, the vcores
value is ignored. If a queue's minimum share is not satisfied, it will be
offered available resources before any other queue under the same parent.
Under the single-resource fairness policy, a queue
is considered unsatisfied if its memory usage is below its minimum memory
share. Under dominant resource fairness, a queue is considered unsatisfied
if its usage for its dominant resource with respect to the cluster capacity
is below its minimum share for that resource. If multiple queues are
unsatisfied in this situation, resources go to the queue with the smallest
ratio between relevant resource usage and minimum. Note that it is
possible that a queue that is below its minimum may not immediately get up
to its minimum when it submits an application, because already-running jobs
may be using those resources.
* maxResources: maximum resources a queue is allowed, in the form
"X mb, Y vcores". For the single-resource fairness policy, the vcores
value is ignored. A queue will never be assigned a container that would
put its aggregate usage over this limit.
* maxRunningApps: limit the number of apps from the queue to run at once
* maxAMShare: limit the fraction of the queue's fair share that can be used
to run application masters. This property can only be used for leaf queues.
For example, if set to 1.0f, then AMs in the leaf queue can take up to 100%
of both the memory and CPU fair share. The value of -1.0f will disable
this feature and the amShare will not be checked. The default value is 0.5f.
* weight: to share the cluster non-proportionally with other queues. Weights
default to 1, and a queue with weight 2 should receive approximately twice
as many resources as a queue with the default weight.
* schedulingPolicy: to set the scheduling policy of any queue. The allowed
values are "fifo"/"fair"/"drf" or any class that extends
<<<org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.SchedulingPolicy>>>.
Defaults to "fair". If "fifo", apps with earlier submit times are given preference
for containers, but apps submitted later may run concurrently if there is
leftover space on the cluster after satisfying the earlier app's requests.
* aclSubmitApps: a list of users and/or groups that can submit apps to the
queue. Refer to the ACLs section below for more info on the format of this
list and how queue ACLs work.
* aclAdministerApps: a list of users and/or groups that can administer a
queue. Currently the only administrative action is killing an application.
Refer to the ACLs section below for more info on the format of this list
and how queue ACLs work.
* minSharePreemptionTimeout: number of seconds the queue is under its minimum share
before it will try to preempt containers to take resources from other queues.
If not set, the queue will inherit the value from its parent queue.
* fairSharePreemptionTimeout: number of seconds the queue is under its fair share
threshold before it will try to preempt containers to take resources from other
queues. If not set, the queue will inherit the value from its parent queue.
* fairSharePreemptionThreshold: the fair share preemption threshold for the
queue. If the queue waits fairSharePreemptionTimeout without receiving
fairSharePreemptionThreshold*fairShare resources, it is allowed to preempt
containers to take resources from other queues. If not set, the queue will
inherit the value from its parent queue.
* <<User elements>>, which represent settings governing the behavior of individual
users. They can contain a single property: maxRunningApps, a limit on the
number of running apps for a particular user.
* <<A userMaxAppsDefault element>>, which sets the default running app limit
for any users whose limit is not otherwise specified.
* <<A defaultFairSharePreemptionTimeout element>>, which sets the fair share
preemption timeout for the root queue; overridden by fairSharePreemptionTimeout
element in root queue.
* <<A defaultMinSharePreemptionTimeout element>>, which sets the min share
preemption timeout for the root queue; overridden by minSharePreemptionTimeout
element in root queue.
* <<A defaultFairSharePreemptionThreshold element>>, which sets the fair share
preemption threshold for the root queue; overridden by fairSharePreemptionThreshold
element in root queue.
* <<A queueMaxAppsDefault element>>, which sets the default running app limit
for queues; overriden by maxRunningApps element in each queue.
* <<A queueMaxAMShareDefault element>>, which sets the default AM resource
limit for queue; overriden by maxAMShare element in each queue.
* <<A defaultQueueSchedulingPolicy element>>, which sets the default scheduling
policy for queues; overriden by the schedulingPolicy element in each queue
if specified. Defaults to "fair".
* <<A queuePlacementPolicy element>>, which contains a list of rule elements
that tell the scheduler how to place incoming apps into queues. Rules
are applied in the order that they are listed. Rules may take arguments. All
rules accept the "create" argument, which indicates whether the rule can create
a new queue. "Create" defaults to true; if set to false and the rule would
place the app in a queue that is not configured in the allocations file, we
continue on to the next rule. The last rule must be one that can never issue a
continue. Valid rules are:
* specified: the app is placed into the queue it requested. If the app
requested no queue, i.e. it specified "default", we continue. If the app
requested a queue name starting or ending with period, i.e. names like
".q1" or "q1." will be rejected.
* user: the app is placed into a queue with the name of the user who
submitted it. Periods in the username will be replace with "_dot_",
i.e. the queue name for user "first.last" is "first_dot_last".
* primaryGroup: the app is placed into a queue with the name of the
primary group of the user who submitted it. Periods in the group name
will be replaced with "_dot_", i.e. the queue name for group "one.two"
is "one_dot_two".
* secondaryGroupExistingQueue: the app is placed into a queue with a name
that matches a secondary group of the user who submitted it. The first
secondary group that matches a configured queue will be selected.
Periods in group names will be replaced with "_dot_", i.e. a user with
"one.two" as one of their secondary groups would be placed into the
"one_dot_two" queue, if such a queue exists.
* nestedUserQueue : the app is placed into a queue with the name of the user
under the queue suggested by the nested rule. This is similar to user
rule,the difference being in 'nestedUserQueue' rule,user queues can be created
under any parent queue, while 'user' rule creates user queues only under root queue.
Note that nestedUserQueue rule would be applied only if the nested rule returns a
parent queue.One can configure a parent queue either by setting 'type' attribute of queue
to 'parent' or by configuring at least one leaf under that queue which makes it a parent.
See example allocation for a sample use case.
* default: the app is placed into the queue specified in the 'queue' attribute of the
default rule. If 'queue' attribute is not specified, the app is placed into 'root.default' queue.
* reject: the app is rejected.
An example allocation file is given here:
---
<?xml version="1.0"?>
<allocations>
<queue name="sample_queue">
<minResources>10000 mb,0vcores</minResources>
<maxResources>90000 mb,0vcores</maxResources>
<maxRunningApps>50</maxRunningApps>
<maxAMShare>0.1</maxAMShare>
<weight>2.0</weight>
<schedulingPolicy>fair</schedulingPolicy>
<queue name="sample_sub_queue">
<aclSubmitApps>charlie</aclSubmitApps>
<minResources>5000 mb,0vcores</minResources>
</queue>
</queue>
<queueMaxAMShareDefault>0.5</queueMaxAMShareDefault>
<!-- Queue 'secondary_group_queue' is a parent queue and may have
user queues under it -->
<queue name="secondary_group_queue" type="parent">
<weight>3.0</weight>
</queue>
<user name="sample_user">
<maxRunningApps>30</maxRunningApps>
</user>
<userMaxAppsDefault>5</userMaxAppsDefault>
<queuePlacementPolicy>
<rule name="specified" />
<rule name="primaryGroup" create="false" />
<rule name="nestedUserQueue">
<rule name="secondaryGroupExistingQueue" create="false" />
</rule>
<rule name="default" queue="sample_queue"/>
</queuePlacementPolicy>
</allocations>
---
Note that for backwards compatibility with the original FairScheduler, "queue" elements can instead be named as "pool" elements.
Queue Access Control Lists (ACLs)
Queue Access Control Lists (ACLs) allow administrators to control who may
take actions on particular queues. They are configured with the aclSubmitApps
and aclAdministerApps properties, which can be set per queue. Currently the
only supported administrative action is killing an application. Anybody who
may administer a queue may also submit applications to it. These properties
take values in a format like "user1,user2 group1,group2" or " group1,group2".
An action on a queue will be permitted if its user or group is in the ACL of
that queue or in the ACL of any of that queue's ancestors. So if queue2
is inside queue1, and user1 is in queue1's ACL, and user2 is in queue2's
ACL, then both users may submit to queue2.
<<Note:>> The delimiter is a space character. To specify only ACL groups, begin the
value with a space character.
The root queue's ACLs are "*" by default which, because ACLs are passed down,
means that everybody may submit to and kill applications from every queue.
To start restricting access, change the root queue's ACLs to something other
than "*".
* {Administration}
The fair scheduler provides support for administration at runtime through a few mechanisms:
Modifying configuration at runtime
It is possible to modify minimum shares, limits, weights, preemption timeouts
and queue scheduling policies at runtime by editing the allocation file. The
scheduler will reload this file 10-15 seconds after it sees that it was
modified.
Monitoring through web UI
Current applications, queues, and fair shares can be examined through the
ResourceManager's web interface, at
http://<ResourceManager URL>/cluster/scheduler.
The following fields can be seen for each queue on the web interface:
* Used Resources - The sum of resources allocated to containers within the queue.
* Num Active Applications - The number of applications in the queue that have
received at least one container.
* Num Pending Applications - The number of applications in the queue that have
not yet received any containers.
* Min Resources - The configured minimum resources that are guaranteed to the queue.
* Max Resources - The configured maximum resources that are allowed to the queue.
* Instantaneous Fair Share - The queue's instantaneous fair share of resources.
These shares consider only actives queues (those with running applications),
and are used for scheduling decisions. Queues may be allocated resources
beyond their shares when other queues aren't using them. A queue whose
resource consumption lies at or below its instantaneous fair share will never
have its containers preempted.
* Steady Fair Share - The queue's steady fair share of resources. These shares
consider all the queues irrespective of whether they are active (have
running applications) or not. These are computed less frequently and
change only when the configuration or capacity changes.They are meant to
provide visibility into resources the user can expect, and hence displayed
in the Web UI.
Moving applications between queues
The Fair Scheduler supports moving a running application to a different queue.
This can be useful for moving an important application to a higher priority
queue, or for moving an unimportant application to a lower priority queue.
Apps can be moved by running "yarn application -movetoqueue appID -queue
targetQueueName".
When an application is moved to a queue, its existing allocations become
counted with the new queue's allocations instead of the old for purposes
of determining fairness. An attempt to move an application to a queue will
fail if the addition of the app's resources to that queue would violate the
its maxRunningApps or maxResources constraints.

View File

@ -1,64 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
NodeManager Overview.
---
---
${maven.build.timestamp}
NodeManager Overview.
%{toc|section=1|fromDepth=0|toDepth=2}
* Overview
The NodeManager is responsible for launching and managing containers on a node. Containers execute tasks as specified by the AppMaster.
* Health checker service
The NodeManager runs services to determine the health of the node it is executing on. The services perform checks on the disk as well as any user specified tests. If any health check fails, the NodeManager marks the node as unhealthy and communicates this to the ResourceManager, which then stops assigning containers to the node. Communication of the node status is done as part of the heartbeat between the NodeManager and the ResourceManager. The intervals at which the disk checker and health monitor(described below) run don't affect the heartbeat intervals. When the heartbeat takes place, the status of both checks is used to determine the health of the node.
** Disk checker
The disk checker checks the state of the disks that the NodeManager is configured to use(local-dirs and log-dirs, configured using yarn.nodemanager.local-dirs and yarn.nodemanager.log-dirs respectively). The checks include permissions and free disk space. It also checks that the filesystem isn't in a read-only state. The checks are run at 2 minute intervals by default but can be configured to run as often as the user desires. If a disk fails the check, the NodeManager stops using that particular disk but still reports the node status as healthy. However if a number of disks fail the check(the number can be configured, as explained below), then the node is reported as unhealthy to the ResourceManager and new containers will not be assigned to the node. In addition, once a disk is marked as unhealthy, the NodeManager stops checking it to see if it has recovered(e.g. disk became full and was then cleaned up). The only way for the NodeManager to use that disk to restart the software on the node. The following configuration parameters can be used to modify the disk checks:
*------------------+----------------+------------------+
|| Configuration name || Allowed Values || Description |
*------------------+----------------+------------------+
| yarn.nodemanager.disk-health-checker.enable | true, false | Enable or disable the disk health checker service |
*------------------+----------------+------------------+
| yarn.nodemanager.disk-health-checker.interval-ms | Positive integer | The interval, in milliseconds, at which the disk checker should run; the default value is 2 minutes |
*------------------+----------------+------------------+
| yarn.nodemanager.disk-health-checker.min-healthy-disks | Float between 0-1 | The minimum fraction of disks that must pass the check for the NodeManager to mark the node as healthy; the default is 0.25 |
*------------------+----------------+------------------+
| yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage | Float between 0-100 | The maximum percentage of disk space that may be utilized before a disk is marked as unhealthy by the disk checker service. This check is run for every disk used by the NodeManager. The default value is 100 i.e. the entire disk can be used. |
*------------------+----------------+------------------+
| yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb | Integer | The minimum amount of free space that must be available on the disk for the disk checker service to mark the disk as healthy. This check is run for every disk used by the NodeManager. The default value is 0 i.e. the entire disk can be used. |
*------------------+----------------+------------------+
** External health script
Users may specify their own health checker script that will be invoked by the health checker service. Users may specify a timeout as well as options to be passed to the script. If the script exits with a non-zero exit code, times out or results in an exception being thrown, the node is marked as unhealthy. Please note that if the script cannot be executed due to permissions or an incorrect path, etc, then it counts as a failure and the node will be reported as unhealthy. Please note that speifying a health check script is not mandatory. If no script is specified, only the disk checker status will be used to determine the health of the node. The following configuration parameters can be used to set the health script:
*------------------+----------------+------------------+
|| Configuration name || Allowed Values || Description |
*------------------+----------------+------------------+
| yarn.nodemanager.health-checker.interval-ms | Postive integer | The interval, in milliseconds, at which health checker service runs; the default value is 10 minutes. |
*------------------+----------------+------------------+
| yarn.nodemanager.health-checker.script.timeout-ms | Postive integer | The timeout for the health script that's executed; the default value is 20 minutes. |
*------------------+----------------+------------------+
| yarn.nodemanager.health-checker.script.path | String | Absolute path to the health check script to be run. |
*------------------+----------------+------------------+
| yarn.nodemanager.health-checker.script.opts | String | Arguments to be passed to the script when the script is executed. |
*------------------+----------------+------------------+

View File

@ -1,77 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
Using CGroups with YARN
---
---
${maven.build.timestamp}
Using CGroups with YARN
%{toc|section=1|fromDepth=0|toDepth=2}
CGroups is a mechanism for aggregating/partitioning sets of tasks, and all their future children, into hierarchical groups with specialized behaviour. CGroups is a Linux kernel feature and was merged into kernel version 2.6.24. From a YARN perspective, this allows containers to be limited in their resource usage. A good example of this is CPU usage. Without CGroups, it becomes hard to limit container CPU usage. Currently, CGroups is only used for limiting CPU usage.
* CGroups configuration
The config variables related to using CGroups are the following:
The following settings are related to setting up CGroups. All of these need to be set in yarn-site.xml.
[[1]] yarn.nodemanager.container-executor.class
This should be set to "org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor". CGroups is a Linux kernel feature and is exposed via the LinuxContainerExecutor.
[[2]] yarn.nodemanager.linux-container-executor.resources-handler.class
This should be set to "org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler".Using the LinuxContainerExecutor doesn't force you to use CGroups. If you wish to use CGroups, the resource-handler-class must be set to CGroupsLCEResourceHandler.
[[3]] yarn.nodemanager.linux-container-executor.cgroups.hierarchy
The cgroups hierarchy under which to place YARN proccesses(cannot contain commas). If yarn.nodemanager.linux-container-executor.cgroups.mount is false (that is, if cgroups have been pre-configured), then this cgroups hierarchy must already exist
[[4]] yarn.nodemanager.linux-container-executor.cgroups.mount
Whether the LCE should attempt to mount cgroups if not found - can be true or false
[[5]] yarn.nodemanager.linux-container-executor.cgroups.mount-path
Where the LCE should attempt to mount cgroups if not found. Common locations include /sys/fs/cgroup and /cgroup; the default location can vary depending on the Linux distribution in use. This path must exist before the NodeManager is launched. Only used when the LCE resources handler is set to the CgroupsLCEResourcesHandler, and yarn.nodemanager.linux-container-executor.cgroups.mount is true. A point to note here is that the container-executor binary will try to mount the path specified + "/" + the subsystem. In our case, since we are trying to limit CPU the binary tries to mount the path specified + "/cpu" and that's the path it expects to exist.
[[6]] yarn.nodemanager.linux-container-executor.group
The Unix group of the NodeManager. It should match the setting in "container-executor.cfg". This configuration is required for validating the secure access of the container-executor binary.
The following settings are related to limiting resource usage of YARN containers
[[1]] yarn.nodemanager.resource.percentage-physical-cpu-limit
This setting lets you limit the cpu usage of all YARN containers. It sets a hard upper limit on the cumulative CPU usage of the containers. For example, if set to 60, the combined CPU usage of all YARN containers will not exceed 60%.
[[2]] yarn.nodemanager.linux-container-executor.cgroups.strict-resource-usage
CGroups allows cpu usage limits to be hard or soft. When this setting is true, containers cannot use more CPU usage than allocated even if spare CPU is available. This ensures that containers can only use CPU that they were allocated. When set to false, containers can use spare CPU if available. It should be noted that irrespective of whether set to true or false, at no time can the combined CPU usage of all containers exceed the value specified in "yarn.nodemanager.resource.percentage-physical-cpu-limit".
* CGroups and security
CGroups itself has no requirements related to security. However, the LinuxContainerExecutor does have some requirements. If running in non-secure mode, by default, the LCE runs all jobs as user "nobody". This user can be changed by setting "yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user" to the desired user. However, it can also be configured to run jobs as the user submitting the job. In that case "yarn.nodemanager.linux-container-executor.nonsecure-mode.limit-users" should be set to false.
*-----------+-----------+---------------------------+
|| yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user || yarn.nodemanager.linux-container-executor.nonsecure-mode.limit-users || User running jobs |
*-----------+-----------+---------------------------+
| (default) | (default) | nobody |
*-----------+-----------+---------------------------+
| yarn | (default) | yarn |
*-----------+-----------+---------------------------+
| yarn | false | (User submitting the job) |
*-----------+-----------+---------------------------+

View File

@ -1,645 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
NodeManager REST API's.
---
---
${maven.build.timestamp}
NodeManager REST API's.
%{toc|section=1|fromDepth=0|toDepth=2}
* Overview
The NodeManager REST API's allow the user to get status on the node and information about applications and containers running on that node.
* NodeManager Information API
The node information resource provides overall information about that particular node.
** URI
Both of the following URI's give you the cluster information.
------
* http://<nm http address:port>/ws/v1/node
* http://<nm http address:port>/ws/v1/node/info
------
** HTTP Operations Supported
------
* GET
------
** Query Parameters Supported
------
None
------
** Elements of the <nodeInfo> object
*---------------+--------------+-------------------------------+
|| Item || Data Type || Description |
*---------------+--------------+-------------------------------+
| id | long | The NodeManager id |
*---------------+--------------+-------------------------------+
| nodeHostName | string | The host name of the NodeManager |
*---------------+--------------+-------------------------------+
| totalPmemAllocatedContainersMB | long | The amount of physical memory allocated for use by containers in MB |
*---------------+--------------+-------------------------------+
| totalVmemAllocatedContainersMB | long | The amount of virtual memory allocated for use by containers in MB |
*---------------+--------------+-------------------------------+
| totalVCoresAllocatedContainers | long | The number of virtual cores allocated for use by containers |
*---------------+--------------+-------------------------------+
| lastNodeUpdateTime | long | The last timestamp at which the health report was received (in ms since epoch)|
*---------------+--------------+-------------------------------+
| healthReport | string | The diagnostic health report of the node |
*---------------+--------------+-------------------------------+
| nodeHealthy | boolean | true/false indicator of if the node is healthy|
*---------------+--------------+-------------------------------+
| nodeManagerVersion | string | Version of the NodeManager |
*---------------+--------------+-------------------------------+
| nodeManagerBuildVersion | string | NodeManager build string with build version, user, and checksum |
*---------------+--------------+-------------------------------+
| nodeManagerVersionBuiltOn | string | Timestamp when NodeManager was built(in ms since epoch) |
*---------------+--------------+-------------------------------+
| hadoopVersion | string | Version of hadoop common |
*---------------+--------------+-------------------------------+
| hadoopBuildVersion | string | Hadoop common build string with build version, user, and checksum |
*---------------+--------------+-------------------------------+
| hadoopVersionBuiltOn | string | Timestamp when hadoop common was built(in ms since epoch) |
*---------------+--------------+-------------------------------+
** Response Examples
<<JSON response>>
HTTP Request:
------
GET http://<nm http address:port>/ws/v1/node/info
------
Response Header:
+---+
HTTP/1.1 200 OK
Content-Type: application/json
Transfer-Encoding: chunked
Server: Jetty(6.1.26)
+---+
Response Body:
+---+
{
"nodeInfo" : {
"hadoopVersionBuiltOn" : "Mon Jan 9 14:58:42 UTC 2012",
"nodeManagerBuildVersion" : "0.23.1-SNAPSHOT from 1228355 by user1 source checksum 20647f76c36430e888cc7204826a445c",
"lastNodeUpdateTime" : 1326222266126,
"totalVmemAllocatedContainersMB" : 17203,
"totalVCoresAllocatedContainers" : 8,
"nodeHealthy" : true,
"healthReport" : "",
"totalPmemAllocatedContainersMB" : 8192,
"nodeManagerVersionBuiltOn" : "Mon Jan 9 15:01:59 UTC 2012",
"nodeManagerVersion" : "0.23.1-SNAPSHOT",
"id" : "host.domain.com:8041",
"hadoopBuildVersion" : "0.23.1-SNAPSHOT from 1228292 by user1 source checksum 3eba233f2248a089e9b28841a784dd00",
"nodeHostName" : "host.domain.com",
"hadoopVersion" : "0.23.1-SNAPSHOT"
}
}
+---+
<<XML response>>
HTTP Request:
-----
Accept: application/xml
GET http://<nm http address:port>/ws/v1/node/info
-----
Response Header:
+---+
HTTP/1.1 200 OK
Content-Type: application/xml
Content-Length: 983
Server: Jetty(6.1.26)
+---+
Response Body:
+---+
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<nodeInfo>
<healthReport/>
<totalVmemAllocatedContainersMB>17203</totalVmemAllocatedContainersMB>
<totalPmemAllocatedContainersMB>8192</totalPmemAllocatedContainersMB>
<totalVCoresAllocatedContainers>8</totalVCoresAllocatedContainers>
<lastNodeUpdateTime>1326222386134</lastNodeUpdateTime>
<nodeHealthy>true</nodeHealthy>
<nodeManagerVersion>0.23.1-SNAPSHOT</nodeManagerVersion>
<nodeManagerBuildVersion>0.23.1-SNAPSHOT from 1228355 by user1 source checksum 20647f76c36430e888cc7204826a445c</nodeManagerBuildVersion>
<nodeManagerVersionBuiltOn>Mon Jan 9 15:01:59 UTC 2012</nodeManagerVersionBuiltOn>
<hadoopVersion>0.23.1-SNAPSHOT</hadoopVersion>
<hadoopBuildVersion>0.23.1-SNAPSHOT from 1228292 by user1 source checksum 3eba233f2248a089e9b28841a784dd00</hadoopBuildVersion>
<hadoopVersionBuiltOn>Mon Jan 9 14:58:42 UTC 2012</hadoopVersionBuiltOn>
<id>host.domain.com:8041</id>
<nodeHostName>host.domain.com</nodeHostName>
</nodeInfo>
+---+
* Applications API
With the Applications API, you can obtain a collection of resources, each of which represents an application. When you run a GET operation on this resource, you obtain a collection of Application Objects. See also {{Application API}} for syntax of the application object.
** URI
------
* http://<nm http address:port>/ws/v1/node/apps
------
** HTTP Operations Supported
------
* GET
------
** Query Parameters Supported
Multiple paramters can be specified.
------
* state - application state
* user - user name
------
** Elements of the <apps> (Applications) object
When you make a request for the list of applications, the information will be returned as a collection of app objects.
See also {{Application API}} for syntax of the app object.
*---------------+--------------+-------------------------------+
|| Item || Data Type || Description |
*---------------+--------------+-------------------------------+
| app | array of app objects(JSON)/zero or more app objects(XML) | A collection of application objects |
*---------------+--------------+--------------------------------+
** Response Examples
<<JSON response>>
HTTP Request:
------
GET http://<nm http address:port>/ws/v1/node/apps
------
Response Header:
+---+
HTTP/1.1 200 OK
Content-Type: application/json
Transfer-Encoding: chunked
Server: Jetty(6.1.26)
+---+
Response Body:
+---+
{
"apps" : {
"app" : [
{
"containerids" : [
"container_1326121700862_0003_01_000001",
"container_1326121700862_0003_01_000002"
],
"user" : "user1",
"id" : "application_1326121700862_0003",
"state" : "RUNNING"
},
{
"user" : "user1",
"id" : "application_1326121700862_0002",
"state" : "FINISHED"
}
]
}
}
+---+
<<XML response>>
HTTP Request:
------
GET http://<nm http address:port>/ws/v1/node/apps
Accept: application/xml
------
Response Header:
+---+
HTTP/1.1 200 OK
Content-Type: application/xml
Content-Length: 400
Server: Jetty(6.1.26)
+---+
Response Body:
+---+
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<apps>
<app>
<id>application_1326121700862_0002</id>
<state>FINISHED</state>
<user>user1</user>
</app>
<app>
<id>application_1326121700862_0003</id>
<state>RUNNING</state>
<user>user1</user>
<containerids>container_1326121700862_0003_01_000002</containerids>
<containerids>container_1326121700862_0003_01_000001</containerids>
</app>
</apps>
+---+
* {Application API}
An application resource contains information about a particular application that was run or is running on this NodeManager.
** URI
Use the following URI to obtain an app Object, for a application identified by the {appid} value.
------
* http://<nm http address:port>/ws/v1/node/apps/{appid}
------
** HTTP Operations Supported
------
* GET
------
** Query Parameters Supported
------
None
------
** Elements of the <app> (Application) object
*---------------+--------------+-------------------------------+
|| Item || Data Type || Description |
*---------------+--------------+-------------------------------+
| id | string | The application id |
*---------------+--------------+--------------------------------+
| user | string | The user who started the application |
*---------------+--------------+--------------------------------+
| state | string | The state of the application - valid states are: NEW, INITING, RUNNING, FINISHING_CONTAINERS_WAIT, APPLICATION_RESOURCES_CLEANINGUP, FINISHED |
*---------------+--------------+--------------------------------+
| containerids | array of containerids(JSON)/zero or more containerids(XML) | The list of containerids currently being used by the application on this node. If not present then no containers are currently running for this application.|
*---------------+--------------+--------------------------------+
** Response Examples
<<JSON response>>
HTTP Request:
------
GET http://<nm http address:port>/ws/v1/node/apps/application_1326121700862_0005
------
Response Header:
+---+
HTTP/1.1 200 OK
Content-Type: application/json
Transfer-Encoding: chunked
Server: Jetty(6.1.26)
+---+
Response Body:
+---+
{
"app" : {
"containerids" : [
"container_1326121700862_0005_01_000003",
"container_1326121700862_0005_01_000001"
],
"user" : "user1",
"id" : "application_1326121700862_0005",
"state" : "RUNNING"
}
}
+---+
<<XML response>>
HTTP Request:
------
GET http://<nm http address:port>/ws/v1/node/apps/application_1326121700862_0005
Accept: application/xml
------
Response Header:
+---+
HTTP/1.1 200 OK
Content-Type: application/xml
Content-Length: 281
Server: Jetty(6.1.26)
+---+
Response Body:
+---+
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<app>
<id>application_1326121700862_0005</id>
<state>RUNNING</state>
<user>user1</user>
<containerids>container_1326121700862_0005_01_000003</containerids>
<containerids>container_1326121700862_0005_01_000001</containerids>
</app>
+---+
* Containers API
With the containers API, you can obtain a collection of resources, each of which represents a container. When you run a GET operation on this resource, you obtain a collection of Container Objects. See also {{Container API}} for syntax of the container object.
** URI
------
* http://<nm http address:port>/ws/v1/node/containers
------
** HTTP Operations Supported
------
* GET
------
** Query Parameters Supported
------
None
------
** Elements of the <containers> object
When you make a request for the list of containers, the information will be returned as collection of container objects.
See also {{Container API}} for syntax of the container object.
*---------------+--------------+-------------------------------+
|| Item || Data Type || Description |
*---------------+--------------+-------------------------------+
| containers | array of container objects(JSON)/zero or more container objects(XML) | A collection of container objects |
*---------------+--------------+-------------------------------+
** Response Examples
<<JSON response>>
HTTP Request:
------
GET http://<nm http address:port>/ws/v1/node/containers
------
Response Header:
+---+
HTTP/1.1 200 OK
Content-Type: application/json
Transfer-Encoding: chunked
Server: Jetty(6.1.26)
+---+
Response Body:
+---+
{
"containers" : {
"container" : [
{
"nodeId" : "host.domain.com:8041",
"totalMemoryNeededMB" : 2048,
"totalVCoresNeeded" : 1,
"state" : "RUNNING",
"diagnostics" : "",
"containerLogsLink" : "http://host.domain.com:8042/node/containerlogs/container_1326121700862_0006_01_000001/user1",
"user" : "user1",
"id" : "container_1326121700862_0006_01_000001",
"exitCode" : -1000
},
{
"nodeId" : "host.domain.com:8041",
"totalMemoryNeededMB" : 2048,
"totalVCoresNeeded" : 2,
"state" : "RUNNING",
"diagnostics" : "",
"containerLogsLink" : "http://host.domain.com:8042/node/containerlogs/container_1326121700862_0006_01_000003/user1",
"user" : "user1",
"id" : "container_1326121700862_0006_01_000003",
"exitCode" : -1000
}
]
}
}
+---+
<<XML response>>
HTTP Request:
------
GET http://<nm http address:port>/ws/v1/node/containers
Accept: application/xml
------
Response Header:
+---+
HTTP/1.1 200 OK
Content-Type: application/xml
Content-Length: 988
Server: Jetty(6.1.26)
+---+
Response Body:
+---+
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<containers>
<container>
<id>container_1326121700862_0006_01_000001</id>
<state>RUNNING</state>
<exitCode>-1000</exitCode>
<diagnostics/>
<user>user1</user>
<totalMemoryNeededMB>2048</totalMemoryNeededMB>
<totalVCoresNeeded>1</totalVCoresNeeded>
<containerLogsLink>http://host.domain.com:8042/node/containerlogs/container_1326121700862_0006_01_000001/user1</containerLogsLink>
<nodeId>host.domain.com:8041</nodeId>
</container>
<container>
<id>container_1326121700862_0006_01_000003</id>
<state>DONE</state>
<exitCode>0</exitCode>
<diagnostics>Container killed by the ApplicationMaster.</diagnostics>
<user>user1</user>
<totalMemoryNeededMB>2048</totalMemoryNeededMB>
<totalVCoresNeeded>2</totalVCoresNeeded>
<containerLogsLink>http://host.domain.com:8042/node/containerlogs/container_1326121700862_0006_01_000003/user1</containerLogsLink>
<nodeId>host.domain.com:8041</nodeId>
</container>
</containers>
+---+
* {Container API}
A container resource contains information about a particular container that is running on this NodeManager.
** URI
Use the following URI to obtain a Container Object, from a container identified by the {containerid} value.
------
* http://<nm http address:port>/ws/v1/node/containers/{containerid}
------
** HTTP Operations Supported
------
* GET
------
** Query Parameters Supported
------
None
------
** Elements of the <container> object
*---------------+--------------+-------------------------------+
|| Item || Data Type || Description |
*---------------+--------------+-------------------------------+
| id | string | The container id |
*---------------+--------------+-------------------------------+
| state | string | State of the container - valid states are: NEW, LOCALIZING, LOCALIZATION_FAILED, LOCALIZED, RUNNING, EXITED_WITH_SUCCESS, EXITED_WITH_FAILURE, KILLING, CONTAINER_CLEANEDUP_AFTER_KILL, CONTAINER_RESOURCES_CLEANINGUP, DONE|
*---------------+--------------+-------------------------------+
| nodeId | string | The id of the node the container is on|
*---------------+--------------+-------------------------------+
| containerLogsLink | string | The http link to the container logs |
*---------------+--------------+-------------------------------+
| user | string | The user name of the user which started the container|
*---------------+--------------+-------------------------------+
| exitCode | int | Exit code of the container |
*---------------+--------------+-------------------------------+
| diagnostics | string | A diagnostic message for failed containers |
*---------------+--------------+-------------------------------+
| totalMemoryNeededMB | long | Total amout of memory needed by the container (in MB) |
*---------------+--------------+-------------------------------+
| totalVCoresNeeded | long | Total number of virtual cores needed by the container |
*---------------+--------------+-------------------------------+
** Response Examples
<<JSON response>>
HTTP Request:
------
GET http://<nm http address:port>/ws/v1/nodes/containers/container_1326121700862_0007_01_000001
------
Response Header:
+---+
HTTP/1.1 200 OK
Content-Type: application/json
Transfer-Encoding: chunked
Server: Jetty(6.1.26)
+---+
Response Body:
+---+
{
"container" : {
"nodeId" : "host.domain.com:8041",
"totalMemoryNeededMB" : 2048,
"totalVCoresNeeded" : 1,
"state" : "RUNNING",
"diagnostics" : "",
"containerLogsLink" : "http://host.domain.com:8042/node/containerlogs/container_1326121700862_0007_01_000001/user1",
"user" : "user1",
"id" : "container_1326121700862_0007_01_000001",
"exitCode" : -1000
}
}
+---+
<<XML response>>
HTTP Request:
------
GET http://<nm http address:port>/ws/v1/node/containers/container_1326121700862_0007_01_000001
Accept: application/xml
------
Response Header:
+---+
HTTP/1.1 200 OK
Content-Type: application/xml
Content-Length: 491
Server: Jetty(6.1.26)
+---+
Response Body:
+---+
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<container>
<id>container_1326121700862_0007_01_000001</id>
<state>RUNNING</state>
<exitCode>-1000</exitCode>
<diagnostics/>
<user>user1</user>
<totalMemoryNeededMB>2048</totalMemoryNeededMB>
<totalVCoresNeeded>1</totalVCoresNeeded>
<containerLogsLink>http://host.domain.com:8042/node/containerlogs/container_1326121700862_0007_01_000001/user1</containerLogsLink>
<nodeId>host.domain.com:8041</nodeId>
</container>
+---+

View File

@ -1,86 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
NodeManager Restart
---
---
${maven.build.timestamp}
NodeManager Restart
* Introduction
This document gives an overview of NodeManager (NM) restart, a feature that
enables the NodeManager to be restarted without losing
the active containers running on the node. At a high level, the NM stores any
necessary state to a local state-store as it processes container-management
requests. When the NM restarts, it recovers by first loading state for
various subsystems and then letting those subsystems perform recovery using
the loaded state.
* Enabling NM Restart
[[1]] To enable NM Restart functionality, set the following property in <<conf/yarn-site.xml>> to true:
*--------------------------------------+--------------------------------------+
|| Property || Value |
*--------------------------------------+--------------------------------------+
| <<<yarn.nodemanager.recovery.enabled>>> | |
| | <<<true>>>, (default value is set to false) |
*--------------------------------------+--------------------------------------+
[[2]] Configure a path to the local file-system directory where the
NodeManager can save its run state
*--------------------------------------+--------------------------------------+
|| Property || Description |
*--------------------------------------+--------------------------------------+
| <<<yarn.nodemanager.recovery.dir>>> | |
| | The local filesystem directory in which the node manager will store state |
| | when recovery is enabled. |
| | The default value is set to |
| | <<<${hadoop.tmp.dir}/yarn-nm-recovery>>>. |
*--------------------------------------+--------------------------------------+
[[3]] Configure a valid RPC address for the NodeManager
*--------------------------------------+--------------------------------------+
|| Property || Description |
*--------------------------------------+--------------------------------------+
| <<<yarn.nodemanager.address>>> | |
| | Ephemeral ports (port 0, which is default) cannot be used for the |
| | NodeManager's RPC server specified via yarn.nodemanager.address as it can |
| | make NM use different ports before and after a restart. This will break any |
| | previously running clients that were communicating with the NM before |
| | restart. Explicitly setting yarn.nodemanager.address to an address with |
| | specific port number (for e.g 0.0.0.0:45454) is a precondition for enabling |
| | NM restart. |
*--------------------------------------+--------------------------------------+
[[4]] Auxiliary services
NodeManagers in a YARN cluster can be configured to run auxiliary services.
For a completely functional NM restart, YARN relies on any auxiliary service
configured to also support recovery. This usually includes (1) avoiding usage
of ephemeral ports so that previously running clients (in this case, usually
containers) are not disrupted after restart and (2) having the auxiliary
service itself support recoverability by reloading any previous state when
NodeManager restarts and reinitializes the auxiliary service.
A simple example for the above is the auxiliary service 'ShuffleHandler' for
MapReduce (MR). ShuffleHandler respects the above two requirements already,
so users/admins don't have do anything for it to support NM restart: (1) The
configuration property <<mapreduce.shuffle.port>> controls which port the
ShuffleHandler on a NodeManager host binds to, and it defaults to a
non-ephemeral port. (2) The ShuffleHandler service also already supports
recovery of previous state after NM restarts.

View File

@ -1,233 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
ResourceManager High Availability
---
---
${maven.build.timestamp}
ResourceManager High Availability
%{toc|section=1|fromDepth=0}
* Introduction
This guide provides an overview of High Availability of YARN's ResourceManager,
and details how to configure and use this feature. The ResourceManager (RM)
is responsible for tracking the resources in a cluster, and scheduling
applications (e.g., MapReduce jobs). Prior to Hadoop 2.4, the ResourceManager
is the single point of failure in a YARN cluster. The High Availability
feature adds redundancy in the form of an Active/Standby ResourceManager pair
to remove this otherwise single point of failure.
* Architecture
[images/rm-ha-overview.png] Overview of ResourceManager High Availability
** RM Failover
ResourceManager HA is realized through an Active/Standby architecture - at
any point of time, one of the RMs is Active, and one or more RMs are in
Standby mode waiting to take over should anything happen to the Active.
The trigger to transition-to-active comes from either the admin (through CLI)
or through the integrated failover-controller when automatic-failover is
enabled.
*** Manual transitions and failover
When automatic failover is not enabled, admins have to manually transition
one of the RMs to Active. To failover from one RM to the other, they are
expected to first transition the Active-RM to Standby and transition a
Standby-RM to Active. All this can be done using the "<<<yarn rmadmin>>>"
CLI.
*** Automatic failover
The RMs have an option to embed the Zookeeper-based ActiveStandbyElector to
decide which RM should be the Active. When the Active goes down or becomes
unresponsive, another RM is automatically elected to be the Active which
then takes over. Note that, there is no need to run a separate ZKFC daemon
as is the case for HDFS because ActiveStandbyElector embedded in RMs acts
as a failure detector and a leader elector instead of a separate ZKFC
deamon.
*** Client, ApplicationMaster and NodeManager on RM failover
When there are multiple RMs, the configuration (yarn-site.xml) used by
clients and nodes is expected to list all the RMs. Clients,
ApplicationMasters (AMs) and NodeManagers (NMs) try connecting to the RMs in
a round-robin fashion until they hit the Active RM. If the Active goes down,
they resume the round-robin polling until they hit the "new" Active.
This default retry logic is implemented as
<<<org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider>>>.
You can override the logic by
implementing <<<org.apache.hadoop.yarn.client.RMFailoverProxyProvider>>> and
setting the value of <<<yarn.client.failover-proxy-provider>>> to
the class name.
** Recovering prevous active-RM's state
With the {{{./ResourceManagerRestart.html}ResourceManger Restart}} enabled,
the RM being promoted to an active state loads the RM internal state and
continues to operate from where the previous active left off as much as
possible depending on the RM restart feature. A new attempt is spawned for
each managed application previously submitted to the RM. Applications can
checkpoint periodically to avoid losing any work. The state-store must be
visible from the both of Active/Standby RMs. Currently, there are two
RMStateStore implementations for persistence - FileSystemRMStateStore
and ZKRMStateStore. The <<<ZKRMStateStore>>> implicitly allows write access
to a single RM at any point in time, and hence is the recommended store to
use in an HA cluster. When using the ZKRMStateStore, there is no need for a
separate fencing mechanism to address a potential split-brain situation
where multiple RMs can potentially assume the Active role.
* Deployment
** Configurations
Most of the failover functionality is tunable using various configuration
properties. Following is a list of required/important ones. yarn-default.xml
carries a full-list of knobs. See
{{{../hadoop-yarn-common/yarn-default.xml}yarn-default.xml}}
for more information including default values.
See {{{./ResourceManagerRestart.html}the document for ResourceManger
Restart}} also for instructions on setting up the state-store.
*-------------------------+----------------------------------------------+
|| Configuration Property || Description |
*-------------------------+----------------------------------------------+
| yarn.resourcemanager.zk-address | |
| | Address of the ZK-quorum.
| | Used both for the state-store and embedded leader-election.
*-------------------------+----------------------------------------------+
| yarn.resourcemanager.ha.enabled | |
| | Enable RM HA
*-------------------------+----------------------------------------------+
| yarn.resourcemanager.ha.rm-ids | |
| | List of logical IDs for the RMs. |
| | e.g., "rm1,rm2" |
*-------------------------+----------------------------------------------+
| yarn.resourcemanager.hostname.<rm-id> | |
| | For each <rm-id>, specify the hostname the |
| | RM corresponds to. Alternately, one could set each of the RM's service |
| | addresses. |
*-------------------------+----------------------------------------------+
| yarn.resourcemanager.ha.id | |
| | Identifies the RM in the ensemble. This is optional; |
| | however, if set, admins have to ensure that all the RMs have their own |
| | IDs in the config |
*-------------------------+----------------------------------------------+
| yarn.resourcemanager.ha.automatic-failover.enabled | |
| | Enable automatic failover; |
| | By default, it is enabled only when HA is enabled. |
*-------------------------+----------------------------------------------+
| yarn.resourcemanager.ha.automatic-failover.embedded | |
| | Use embedded leader-elector |
| | to pick the Active RM, when automatic failover is enabled. By default, |
| | it is enabled only when HA is enabled. |
*-------------------------+----------------------------------------------+
| yarn.resourcemanager.cluster-id | |
| | Identifies the cluster. Used by the elector to |
| | ensure an RM doesn't take over as Active for another cluster. |
*-------------------------+----------------------------------------------+
| yarn.client.failover-proxy-provider | |
| | The class to be used by Clients, AMs and NMs to failover to the Active RM. |
*-------------------------+----------------------------------------------+
| yarn.client.failover-max-attempts | |
| | The max number of times FailoverProxyProvider should attempt failover. |
*-------------------------+----------------------------------------------+
| yarn.client.failover-sleep-base-ms | |
| | The sleep base (in milliseconds) to be used for calculating |
| | the exponential delay between failovers. |
*-------------------------+----------------------------------------------+
| yarn.client.failover-sleep-max-ms | |
| | The maximum sleep time (in milliseconds) between failovers |
*-------------------------+----------------------------------------------+
| yarn.client.failover-retries | |
| | The number of retries per attempt to connect to a ResourceManager. |
*-------------------------+----------------------------------------------+
| yarn.client.failover-retries-on-socket-timeouts | |
| | The number of retries per attempt to connect to a ResourceManager on socket timeouts. |
*-------------------------+----------------------------------------------+
*** Sample configurations
Here is the sample of minimal setup for RM failover.
+---+
<property>
<name>yarn.resourcemanager.ha.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.resourcemanager.cluster-id</name>
<value>cluster1</value>
</property>
<property>
<name>yarn.resourcemanager.ha.rm-ids</name>
<value>rm1,rm2</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.rm1</name>
<value>master1</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.rm2</name>
<value>master2</value>
</property>
<property>
<name>yarn.resourcemanager.zk-address</name>
<value>zk1:2181,zk2:2181,zk3:2181</value>
</property>
+---+
** Admin commands
<<<yarn rmadmin>>> has a few HA-specific command options to check the health/state of an
RM, and transition to Active/Standby.
Commands for HA take service id of RM set by <<<yarn.resourcemanager.ha.rm-ids>>>
as argument.
+---+
$ yarn rmadmin -getServiceState rm1
active
$ yarn rmadmin -getServiceState rm2
standby
+---+
If automatic failover is enabled, you can not use manual transition command.
Though you can override this by --forcemanual flag, you need caution.
+---+
$ yarn rmadmin -transitionToStandby rm1
Automatic failover is enabled for org.apache.hadoop.yarn.client.RMHAServiceTarget@1d8299fd
Refusing to manually manage HA state, since it may cause
a split-brain scenario or other incorrect state.
If you are very sure you know what you are doing, please
specify the forcemanual flag.
+---+
See {{{./YarnCommands.html}YarnCommands}} for more details.
** ResourceManager Web UI services
Assuming a standby RM is up and running, the Standby automatically redirects
all web requests to the Active, except for the "About" page.
** Web Services
Assuming a standby RM is up and running, RM web-services described at
{{{./ResourceManagerRest.html}ResourceManager REST APIs}} when invoked on
a standby RM are automatically redirected to the Active RM.

View File

@ -1,298 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
ResourceManager Restart
---
---
${maven.build.timestamp}
ResourceManager Restart
%{toc|section=1|fromDepth=0}
* {Overview}
ResourceManager is the central authority that manages resources and schedules
applications running atop of YARN. Hence, it is potentially a single point of
failure in a Apache YARN cluster.
This document gives an overview of ResourceManager Restart, a feature that
enhances ResourceManager to keep functioning across restarts and also makes
ResourceManager down-time invisible to end-users.
ResourceManager Restart feature is divided into two phases:
ResourceManager Restart Phase 1 (Non-work-preserving RM restart):
Enhance RM to persist application/attempt state
and other credentials information in a pluggable state-store. RM will reload
this information from state-store upon restart and re-kick the previously
running applications. Users are not required to re-submit the applications.
ResourceManager Restart Phase 2 (Work-preserving RM restart):
Focus on re-constructing the running state of ResourceManager by combining
the container statuses from NodeManagers and container requests from ApplicationMasters
upon restart. The key difference from phase 1 is that previously running applications
will not be killed after RM restarts, and so applications won't lose its work
because of RM outage.
* {Feature}
** Phase 1: Non-work-preserving RM restart
As of Hadoop 2.4.0 release, only ResourceManager Restart Phase 1 is implemented which
is described below.
The overall concept is that RM will persist the application metadata
(i.e. ApplicationSubmissionContext) in
a pluggable state-store when client submits an application and also saves the final status
of the application such as the completion state (failed, killed, finished)
and diagnostics when the application completes. Besides, RM also saves
the credentials like security keys, tokens to work in a secure environment.
Any time RM shuts down, as long as the required information (i.e.application metadata
and the alongside credentials if running in a secure environment) is available
in the state-store, when RM restarts, it can pick up the application metadata
from the state-store and re-submit the application. RM won't re-submit the
applications if they were already completed (i.e. failed, killed, finished)
before RM went down.
NodeManagers and clients during the down-time of RM will keep polling RM until
RM comes up. When RM becomes alive, it will send a re-sync command to
all the NodeManagers and ApplicationMasters it was talking to via heartbeats.
As of Hadoop 2.4.0 release, the behaviors for NodeManagers and ApplicationMasters to handle this command
are: NMs will kill all its managed containers and re-register with RM. From the
RM's perspective, these re-registered NodeManagers are similar to the newly joining NMs.
AMs(e.g. MapReduce AM) are expected to shutdown when they receive the re-sync command.
After RM restarts and loads all the application metadata, credentials from state-store
and populates them into memory, it will create a new
attempt (i.e. ApplicationMaster) for each application that was not yet completed
and re-kick that application as usual. As described before, the previously running
applications' work is lost in this manner since they are essentially killed by
RM via the re-sync command on restart.
** Phase 2: Work-preserving RM restart
As of Hadoop 2.6.0, we further enhanced RM restart feature to address the problem
to not kill any applications running on YARN cluster if RM restarts.
Beyond all the groundwork that has been done in Phase 1 to ensure the persistency
of application state and reload that state on recovery, Phase 2 primarily focuses
on re-constructing the entire running state of YARN cluster, the majority of which is
the state of the central scheduler inside RM which keeps track of all containers' life-cycle,
applications' headroom and resource requests, queues' resource usage etc. In this way,
RM doesn't need to kill the AM and re-run the application from scratch as it is
done in Phase 1. Applications can simply re-sync back with RM and
resume from where it were left off.
RM recovers its runing state by taking advantage of the container statuses sent from all NMs.
NM will not kill the containers when it re-syncs with the restarted RM. It continues
managing the containers and send the container statuses across to RM when it re-registers.
RM reconstructs the container instances and the associated applications' scheduling status by
absorbing these containers' information. In the meantime, AM needs to re-send the
outstanding resource requests to RM because RM may lose the unfulfilled requests when it shuts down.
Application writers using AMRMClient library to communicate with RM do not need to
worry about the part of AM re-sending resource requests to RM on re-sync, as it is
automatically taken care by the library itself.
* {Configurations}
** Enable RM Restart.
*--------------------------------------+--------------------------------------+
|| Property || Value |
*--------------------------------------+--------------------------------------+
| <<<yarn.resourcemanager.recovery.enabled>>> | |
| | <<<true>>> |
*--------------------------------------+--------------------------------------+
** Configure the state-store for persisting the RM state.
*--------------------------------------*--------------------------------------+
|| Property || Description |
*--------------------------------------+--------------------------------------+
| <<<yarn.resourcemanager.store.class>>> | |
| | The class name of the state-store to be used for saving application/attempt |
| | state and the credentials. The available state-store implementations are |
| | <<<org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore>>> |
| | , a ZooKeeper based state-store implementation and |
| | <<<org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore>>> |
| | , a Hadoop FileSystem based state-store implementation like HDFS and local FS. |
| | <<<org.apache.hadoop.yarn.server.resourcemanager.recovery.LeveldbRMStateStore>>>, |
| | a LevelDB based state-store implementation. |
| | The default value is set to |
| | <<<org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore>>>. |
*--------------------------------------+--------------------------------------+
** How to choose the state-store implementation.
<<ZooKeeper based state-store>>: User is free to pick up any storage to set up RM restart,
but must use ZooKeeper based state-store to support RM HA. The reason is that only ZooKeeper
based state-store supports fencing mechanism to avoid a split-brain situation where multiple
RMs assume they are active and can edit the state-store at the same time.
<<FileSystem based state-store>>: HDFS and local FS based state-store are supported.
Fencing mechanism is not supported.
<<LevelDB based state-store>>: LevelDB based state-store is considered more light weight than HDFS and ZooKeeper
based state-store. LevelDB supports better atomic operations, fewer I/O ops per state update,
and far fewer total files on the filesystem. Fencing mechanism is not supported.
** Configurations for Hadoop FileSystem based state-store implementation.
Support both HDFS and local FS based state-store implementation. The type of file system to
be used is determined by the scheme of URI. e.g. <<<hdfs://localhost:9000/rmstore>>> uses HDFS as the storage and
<<<file:///tmp/yarn/rmstore>>> uses local FS as the storage. If no
scheme(<<<hdfs://>>> or <<<file://>>>) is specified in the URI, the type of storage to be used is
determined by <<<fs.defaultFS>>> defined in <<<core-site.xml>>>.
Configure the URI where the RM state will be saved in the Hadoop FileSystem state-store.
*--------------------------------------+--------------------------------------+
|| Property || Description |
*--------------------------------------+--------------------------------------+
| <<<yarn.resourcemanager.fs.state-store.uri>>> | |
| | URI pointing to the location of the FileSystem path where RM state will |
| | be stored (e.g. hdfs://localhost:9000/rmstore). |
| | Default value is <<<${hadoop.tmp.dir}/yarn/system/rmstore>>>. |
| | If FileSystem name is not provided, <<<fs.default.name>>> specified in |
| | <<conf/core-site.xml>> will be used. |
*--------------------------------------+--------------------------------------+
Configure the retry policy state-store client uses to connect with the Hadoop
FileSystem.
*--------------------------------------+--------------------------------------+
|| Property || Description |
*--------------------------------------+--------------------------------------+
| <<<yarn.resourcemanager.fs.state-store.retry-policy-spec>>> | |
| | Hadoop FileSystem client retry policy specification. Hadoop FileSystem client retry |
| | is always enabled. Specified in pairs of sleep-time and number-of-retries |
| | i.e. (t0, n0), (t1, n1), ..., the first n0 retries sleep t0 milliseconds on |
| | average, the following n1 retries sleep t1 milliseconds on average, and so on. |
| | Default value is (2000, 500) |
*--------------------------------------+--------------------------------------+
** Configurations for ZooKeeper based state-store implementation.
Configure the ZooKeeper server address and the root path where the RM state is stored.
*--------------------------------------+--------------------------------------+
|| Property || Description |
*--------------------------------------+--------------------------------------+
| <<<yarn.resourcemanager.zk-address>>> | |
| | Comma separated list of Host:Port pairs. Each corresponds to a ZooKeeper server |
| | (e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002") to be used by the RM |
| | for storing RM state. |
*--------------------------------------+--------------------------------------+
| <<<yarn.resourcemanager.zk-state-store.parent-path>>> | |
| | The full path of the root znode where RM state will be stored. |
| | Default value is /rmstore. |
*--------------------------------------+--------------------------------------+
Configure the retry policy state-store client uses to connect with the ZooKeeper server.
*--------------------------------------+--------------------------------------+
|| Property || Description |
*--------------------------------------+--------------------------------------+
| <<<yarn.resourcemanager.zk-num-retries>>> | |
| | Number of times RM tries to connect to ZooKeeper server if the connection is lost. |
| | Default value is 500. |
*--------------------------------------+--------------------------------------+
| <<<yarn.resourcemanager.zk-retry-interval-ms>>> |
| | The interval in milliseconds between retries when connecting to a ZooKeeper server. |
| | Default value is 2 seconds. |
*--------------------------------------+--------------------------------------+
| <<<yarn.resourcemanager.zk-timeout-ms>>> | |
| | ZooKeeper session timeout in milliseconds. This configuration is used by |
| | the ZooKeeper server to determine when the session expires. Session expiration |
| | happens when the server does not hear from the client (i.e. no heartbeat) within the session |
| | timeout period specified by this configuration. Default |
| | value is 10 seconds |
*--------------------------------------+--------------------------------------+
Configure the ACLs to be used for setting permissions on ZooKeeper znodes.
*--------------------------------------+--------------------------------------+
|| Property || Description |
*--------------------------------------+--------------------------------------+
| <<<yarn.resourcemanager.zk-acl>>> | |
| | ACLs to be used for setting permissions on ZooKeeper znodes. Default value is <<<world:anyone:rwcda>>> |
*--------------------------------------+--------------------------------------+
** Configurations for LevelDB based state-store implementation.
*--------------------------------------+--------------------------------------+
|| Property || Description |
*--------------------------------------+--------------------------------------+
| <<<yarn.resourcemanager.leveldb-state-store.path>>> | |
| | Local path where the RM state will be stored. |
| | Default value is <<<${hadoop.tmp.dir}/yarn/system/rmstore>>> |
*--------------------------------------+--------------------------------------+
** Configurations for work-preserving RM recovery.
*--------------------------------------+--------------------------------------+
|| Property || Description |
*--------------------------------------+--------------------------------------+
| <<<yarn.resourcemanager.work-preserving-recovery.scheduling-wait-ms>>> | |
| | Set the amount of time RM waits before allocating new |
| | containers on RM work-preserving recovery. Such wait period gives RM a chance |
| | to settle down resyncing with NMs in the cluster on recovery, before assigning|
| | new containers to applications.|
*--------------------------------------+--------------------------------------+
* {Notes}
ContainerId string format is changed if RM restarts with work-preserving recovery enabled.
It used to be such format:
Container_\{clusterTimestamp\}_\{appId\}_\{attemptId\}_\{containerId\}, e.g. Container_1410901177871_0001_01_000005.
It is now changed to:
Container_<<e\{epoch\}>>_\{clusterTimestamp\}_\{appId\}_\{attemptId\}_\{containerId\}, e.g. Container_<<e17>>_1410901177871_0001_01_000005.
Here, the additional epoch number is a
monotonically increasing integer which starts from 0 and is increased by 1 each time
RM restarts. If epoch number is 0, it is omitted and the containerId string format
stays the same as before.
* {Sample configurations}
Below is a minimum set of configurations for enabling RM work-preserving restart using ZooKeeper based state store.
+---+
<property>
<description>Enable RM to recover state after starting. If true, then
yarn.resourcemanager.store.class must be specified</description>
<name>yarn.resourcemanager.recovery.enabled</name>
<value>true</value>
</property>
<property>
<description>The class to use as the persistent store.</description>
<name>yarn.resourcemanager.store.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
</property>
<property>
<description>Comma separated list of Host:Port pairs. Each corresponds to a ZooKeeper server
(e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002") to be used by the RM for storing RM state.
This must be supplied when using org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore
as the value for yarn.resourcemanager.store.class</description>
<name>yarn.resourcemanager.zk-address</name>
<value>127.0.0.1:2181</value>
</property>
+---+

View File

@ -1,176 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
YARN Secure Containers
---
---
${maven.build.timestamp}
YARN Secure Containers
%{toc|section=1|fromDepth=0|toDepth=3}
* {Overview}
YARN containers in a secure cluster use the operating system facilities to offer
execution isolation for containers. Secure containers execute under the credentials
of the job user. The operating system enforces access restriction for the container.
The container must run as the use that submitted the application.
Secure Containers work only in the context of secured YARN clusters.
** Container isolation requirements
The container executor must access the local files and directories needed by the
container such as jars, configuration files, log files, shared objects etc. Although
it is launched by the NodeManager, the container should not have access to the
NodeManager private files and configuration. Container running applications
submitted by different users should be isolated and unable to access each other
files and directories. Similar requirements apply to other system non-file securable
objects like named pipes, critical sections, LPC queues, shared memory etc.
** Linux Secure Container Executor
On Linux environment the secure container executor is the <<<LinuxContainerExecutor>>>.
It uses an external program called the <<container-executor>>> to launch the container.
This program has the <<<setuid>>> access right flag set which allows it to launch
the container with the permissions of the YARN application user.
*** Configuration
The configured directories for <<<yarn.nodemanager.local-dirs>>> and
<<<yarn.nodemanager.log-dirs>>> must be owned by the configured NodeManager user
(<<<yarn>>>) and group (<<<hadoop>>>). The permission set on these directories must
be <<<drwxr-xr-x>>>.
The <<<container-executor>>> program must be owned by <<<root>>> and have the
permission set <<<---sr-s--->>>.
To configure the <<<NodeManager>>> to use the <<<LinuxContainerExecutor>>> set the following
in the <<conf/yarn-site.xml>>:
+---+
<property>
<name>yarn.nodemanager.container-executor.class</name>
<value>org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor</value>
</property>
<property>
<name>yarn.nodemanager.linux-container-executor.group</name>
<value>hadoop</value>
</property>
+---+
Additionally the LCE requires the <<<container-executor.cfg>>> file, which is read by the
<<<container-executor>>> program.
+---+
yarn.nodemanager.linux-container-executor.group=#configured value of yarn.nodemanager.linux-container-executor.group
banned.users=#comma separated list of users who can not run applications
allowed.system.users=#comma separated list of allowed system users
min.user.id=1000#Prevent other super-users
+---+
** Windows Secure Container Executor (WSCE)
The Windows environment secure container executor is the <<<WindowsSecureContainerExecutor>>>.
It uses the Windows S4U infrastructure to launch the container as the
YARN application user. The WSCE requires the presense of the <<<hadoopwinutilsvc>>> service. This services
is hosted by <<<%HADOOP_HOME%\bin\winutils.exe>>> started with the <<<service>>> command line argument. This
service offers some privileged operations that require LocalSystem authority so that the NM is not required
to run the entire JVM and all the NM code in an elevated context. The NM interacts with the <<<hadoopwintulsvc>>>
service by means of Local RPC (LRPC) via calls JNI to the RCP client hosted in <<<hadoop.dll>>>.
*** Configuration
To configure the <<<NodeManager>>> to use the <<<WindowsSecureContainerExecutor>>>
set the following in the <<conf/yarn-site.xml>>:
+---+
<property>
<name>yarn.nodemanager.container-executor.class</name>
<value>org.apache.hadoop.yarn.server.nodemanager.WindowsSecureContainerExecutor</value>
</property>
<property>
<name>yarn.nodemanager.windows-secure-container-executor.group</name>
<value>yarn</value>
</property>
+---+
*** wsce-site.xml
The hadoopwinutilsvc uses <<<%HADOOP_HOME%\etc\hadoop\wsce_site.xml>>> to configure access to the privileged operations.
+---+
<property>
<name>yarn.nodemanager.windows-secure-container-executor.impersonate.allowed</name>
<value>HadoopUsers</value>
</property>
<property>
<name>yarn.nodemanager.windows-secure-container-executor.impersonate.denied</name>
<value>HadoopServices,Administrators</value>
</property>
<property>
<name>yarn.nodemanager.windows-secure-container-executor.allowed</name>
<value>nodemanager</value>
</property>
<property>
<name>yarn.nodemanager.windows-secure-container-executor.local-dirs</name>
<value>nm-local-dir, nm-log-dirs</value>
</property>
<property>
<name>yarn.nodemanager.windows-secure-container-executor.job-name</name>
<value>nodemanager-job-name</value>
</property>
+---+
<<<yarn.nodemanager.windows-secure-container-executor.allowed>>> should contain the name of the service account running the
nodemanager. This user will be allowed to access the hadoopwintuilsvc functions.
<<<yarn.nodemanager.windows-secure-container-executor.impersonate.allowed>>> should contain the users that are allowed to create
containers in the cluster. These users will be allowed to be impersonated by hadoopwinutilsvc.
<<<yarn.nodemanager.windows-secure-container-executor.impersonate.denied>>> should contain users that are explictly forbiden from
creating containers. hadoopwinutilsvc will refuse to impersonate these users.
<<<yarn.nodemanager.windows-secure-container-executor.local-dirs>>> should contain the nodemanager local dirs. hadoopwinutilsvc will
allow only file operations under these directories. This should contain the same values as <<<${yarn.nodemanager.local-dirs}, ${yarn.nodemanager.log-dirs}>>>
but note that hadoopwinutilsvc XML configuration processing does not do substitutions so the value must be the final value. All paths
must be absolute and no environment variable substitution will be performed. The paths are compared LOCAL_INVARIANT case insensitive string comparison,
the file path validated must start with one of the paths listed in local-dirs configuration. Use comma as path separator:<<<,>>>
<<<yarn.nodemanager.windows-secure-container-executor.job-name>>> should contain an Windows NT job name that all containers should be added to.
This configuration is optional. If not set, the container is not added to a global NodeManager job. Normally this should be set to the job that the NM is assigned to,
so that killing the NM kills also all containers. Hadoopwinutilsvc will not attempt to create this job, the job must exists when the container is launched.
If the value is set and the job does not exists, container launch will fail with error 2 <<<The system cannot find the file specified>>>.
Note that this global NM job is not related to the container job, which always gets created for each container and is named after the container ID.
This setting controls a global job that spans all containers and the parent NM, and as such it requires nested jobs.
Nested jobs are available only post Windows 8 and Windows Server 2012.
*** Useful Links
* {{{http://msdn.microsoft.com/en-us/magazine/cc188757.aspx}Exploring S4U Kerberos Extensions in Windows Server 2003}}
* {{{http://msdn.microsoft.com/en-us/library/windows/desktop/hh448388(v=vs.85).aspx}Nested Jobs}}
* {{{https://issues.apache.org/jira/browse/YARN-1063}Winutils needs ability to create task as domain user}}
* {{{https://issues.apache.org/jira/browse/YARN-1972}Implement secure Windows Container Executor}}
* {{{https://issues.apache.org/jira/browse/YARN-2198}Remove the need to run NodeManager as privileged account for Windows Secure Container Executor}}

View File

@ -1,260 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
YARN Timeline Server
---
---
${maven.build.timestamp}
YARN Timeline Server
%{toc|section=1|fromDepth=0|toDepth=3}
* Overview
Storage and retrieval of applications' current as well as historic
information in a generic fashion is solved in YARN through the Timeline
Server (previously also called Generic Application History Server). This
serves two responsibilities:
** Generic information about completed applications
Generic information includes application level data like queue-name, user
information etc in the ApplicationSubmissionContext, list of
application-attempts that ran for an application, information about each
application-attempt, list of containers run under each application-attempt,
and information about each container. Generic data is stored by
ResourceManager to a history-store (default implementation on a file-system)
and used by the web-UI to display information about completed applications.
** Per-framework information of running and completed applications
Per-framework information is completely specific to an application or
framework. For example, Hadoop MapReduce framework can include pieces of
information like number of map tasks, reduce tasks, counters etc.
Application developers can publish the specific information to the Timeline
server via TimelineClient from within a client, the ApplicationMaster
and/or the application's containers. This information is then queryable via
REST APIs for rendering by application/framework specific UIs.
* Current Status
Timeline sever is a work in progress. The basic storage and retrieval of
information, both generic and framework specific, are in place. Timeline
server doesn't work in secure mode yet. The generic information and the
per-framework information are today collected and presented separately and
thus are not integrated well together. Finally, the per-framework information
is only available via RESTful APIs, using JSON type content - ability to
install framework specific UIs in YARN isn't supported yet.
* Basic Configuration
Users need to configure the Timeline server before starting it. The simplest
configuration you should add in <<<yarn-site.xml>>> is to set the hostname of
the Timeline server:
+---+
<property>
<description>The hostname of the Timeline service web application.</description>
<name>yarn.timeline-service.hostname</name>
<value>0.0.0.0</value>
</property>
+---+
* Advanced Configuration
In addition to the hostname, admins can also configure whether the service is
enabled or not, the ports of the RPC and the web interfaces, and the number
of RPC handler threads.
+---+
<property>
<description>Address for the Timeline server to start the RPC server.</description>
<name>yarn.timeline-service.address</name>
<value>${yarn.timeline-service.hostname}:10200</value>
</property>
<property>
<description>The http address of the Timeline service web application.</description>
<name>yarn.timeline-service.webapp.address</name>
<value>${yarn.timeline-service.hostname}:8188</value>
</property>
<property>
<description>The https address of the Timeline service web application.</description>
<name>yarn.timeline-service.webapp.https.address</name>
<value>${yarn.timeline-service.hostname}:8190</value>
</property>
<property>
<description>Handler thread count to serve the client RPC requests.</description>
<name>yarn.timeline-service.handler-thread-count</name>
<value>10</value>
</property>
<property>
<description>Enables cross-origin support (CORS) for web services where
cross-origin web response headers are needed. For example, javascript making
a web services request to the timeline server.</description>
<name>yarn.timeline-service.http-cross-origin.enabled</name>
<value>false</value>
</property>
<property>
<description>Comma separated list of origins that are allowed for web
services needing cross-origin (CORS) support. Wildcards (*) and patterns
allowed</description>
<name>yarn.timeline-service.http-cross-origin.allowed-origins</name>
<value>*</value>
</property>
<property>
<description>Comma separated list of methods that are allowed for web
services needing cross-origin (CORS) support.</description>
<name>yarn.timeline-service.http-cross-origin.allowed-methods</name>
<value>GET,POST,HEAD</value>
</property>
<property>
<description>Comma separated list of headers that are allowed for web
services needing cross-origin (CORS) support.</description>
<name>yarn.timeline-service.http-cross-origin.allowed-headers</name>
<value>X-Requested-With,Content-Type,Accept,Origin</value>
</property>
<property>
<description>The number of seconds a pre-flighted request can be cached
for web services needing cross-origin (CORS) support.</description>
<name>yarn.timeline-service.http-cross-origin.max-age</name>
<value>1800</value>
</property>
+---+
* Generic-data related Configuration
Users can specify whether the generic data collection is enabled or not, and
also choose the storage-implementation class for the generic data. There are
more configurations related to generic data collection, and users can refer
to <<<yarn-default.xml>>> for all of them.
+---+
<property>
<description>Indicate to ResourceManager as well as clients whether
history-service is enabled or not. If enabled, ResourceManager starts
recording historical data that Timelien service can consume. Similarly,
clients can redirect to the history service when applications
finish if this is enabled.</description>
<name>yarn.timeline-service.generic-application-history.enabled</name>
<value>false</value>
</property>
<property>
<description>Store class name for history store, defaulting to file system
store</description>
<name>yarn.timeline-service.generic-application-history.store-class</name>
<value>org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore</value>
</property>
+---+
* Per-framework-date related Configuration
Users can specify whether per-framework data service is enabled or not,
choose the store implementation for the per-framework data, and tune the
retention of the per-framework data. There are more configurations related to
per-framework data service, and users can refer to <<<yarn-default.xml>>> for
all of them.
+---+
<property>
<description>Indicate to clients whether Timeline service is enabled or not.
If enabled, the TimelineClient library used by end-users will post entities
and events to the Timeline server.</description>
<name>yarn.timeline-service.enabled</name>
<value>true</value>
</property>
<property>
<description>Store class name for timeline store.</description>
<name>yarn.timeline-service.store-class</name>
<value>org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore</value>
</property>
<property>
<description>Enable age off of timeline store data.</description>
<name>yarn.timeline-service.ttl-enable</name>
<value>true</value>
</property>
<property>
<description>Time to live for timeline store data in milliseconds.</description>
<name>yarn.timeline-service.ttl-ms</name>
<value>604800000</value>
</property>
+---+
* Running Timeline server
Assuming all the aforementioned configurations are set properly, admins can
start the Timeline server/history service with the following command:
+---+
$ yarn timelineserver
+---+
Or users can start the Timeline server / history service as a daemon:
+---+
$ yarn --daemon start timelineserver
+---+
* Accessing generic-data via command-line
Users can access applications' generic historic data via the command line as
below. Note that the same commands are usable to obtain the corresponding
information about running applications.
+---+
$ yarn application -status <Application ID>
$ yarn applicationattempt -list <Application ID>
$ yarn applicationattempt -status <Application Attempt ID>
$ yarn container -list <Application Attempt ID>
$ yarn container -status <Container ID>
+---+
* Publishing of per-framework data by applications
Developers can define what information they want to record for their
applications by composing <<<TimelineEntity>>> and <<<TimelineEvent>>>
objects, and put the entities and events to the Timeline server via
<<<TimelineClient>>>. Below is an example:
+---+
// Create and start the Timeline client
TimelineClient client = TimelineClient.createTimelineClient();
client.init(conf);
client.start();
TimelineEntity entity = null;
// Compose the entity
try {
TimelinePutResponse response = client.putEntities(entity);
} catch (IOException e) {
// Handle the exception
} catch (YarnException e) {
// Handle the exception
}
// Stop the Timeline client
client.stop();
+---+

View File

@ -1,49 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
YARN
---
---
${maven.build.timestamp}
Web Application Proxy
The Web Application Proxy is part of YARN. By default it will run as part of
the Resource Manager(RM), but can be configured to run in stand alone mode.
The reason for the proxy is to reduce the possibility of web based attacks
through YARN.
In YARN the Application Master(AM) has the responsibility to provide a web UI
and to send that link to the RM. This opens up a number of potential
issues. The RM runs as a trusted user, and people visiting that web
address will treat it, and links it provides to them as trusted, when in
reality the AM is running as a non-trusted user, and the links it gives to
the RM could point to anything malicious or otherwise. The Web Application
Proxy mitigates this risk by warning users that do not own the given
application that they are connecting to an untrusted site.
In addition to this the proxy also tries to reduce the impact that a malicious
AM could have on a user. It primarily does this by stripping out cookies from
the user, and replacing them with a single cookie providing the user name of
the logged in user. This is because most web based authentication systems will
identify a user based off of a cookie. By providing this cookie to an
untrusted application it opens up the potential for an exploit. If the cookie
is designed properly that potential should be fairly minimal, but this is just
to reduce that potential attack vector. The current proxy implementation does
nothing to prevent the AM from providing links to malicious external sites,
nor does it do anything to prevent malicious javascript code from running as
well. In fact javascript can be used to get the cookies, so stripping the
cookies from the request has minimal benefit at this time.
In the future we hope to address the attack vectors described above and make
attaching to an AM's web UI safer.

View File

@ -1,757 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
Hadoop Map Reduce Next Generation-${project.version} - Writing YARN
Applications
---
---
${maven.build.timestamp}
Hadoop MapReduce Next Generation - Writing YARN Applications
%{toc|section=1|fromDepth=0}
* Purpose
This document describes, at a high-level, the way to implement new
Applications for YARN.
* Concepts and Flow
The general concept is that an <application submission client> submits an
<application> to the YARN <ResourceManager> (RM). This can be done through
setting up a <<<YarnClient>>> object. After <<<YarnClient>>> is started, the
client can then set up application context, prepare the very first container of
the application that contains the <ApplicationMaster> (AM), and then submit
the application. You need to provide information such as the details about the
local files/jars that need to be available for your application to run, the
actual command that needs to be executed (with the necessary command line
arguments), any OS environment settings (optional), etc. Effectively, you
need to describe the Unix process(es) that needs to be launched for your
ApplicationMaster.
The YARN ResourceManager will then launch the ApplicationMaster (as
specified) on an allocated container. The ApplicationMaster communicates with
YARN cluster, and handles application execution. It performs operations in an
asynchronous fashion. During application launch time, the main tasks of the
ApplicationMaster are: a) communicating with the ResourceManager to negotiate
and allocate resources for future containers, and b) after container
allocation, communicating YARN <NodeManager>s (NMs) to launch application
containers on them. Task a) can be performed asynchronously through an
<<<AMRMClientAsync>>> object, with event handling methods specified in a
<<<AMRMClientAsync.CallbackHandler>>> type of event handler. The event handler
needs to be set to the client explicitly. Task b) can be performed by launching
a runnable object that then launches containers when there are containers
allocated. As part of launching this container, the AM has to
specify the <<<ContainerLaunchContext>>> that has the launch information such as
command line specification, environment, etc.
During the execution of an application, the ApplicationMaster communicates
NodeManagers through <<<NMClientAsync>>> object. All container events are
handled by <<<NMClientAsync.CallbackHandler>>>, associated with
<<<NMClientAsync>>>. A typical callback handler handles client start, stop,
status update and error. ApplicationMaster also reports execution progress to
ResourceManager by handling the <<<getProgress()>>> method of
<<<AMRMClientAsync.CallbackHandler>>>.
Other than asynchronous clients, there are synchronous versions for certain
workflows (<<<AMRMClient>>> and <<<NMClient>>>). The asynchronous clients are
recommended because of (subjectively) simpler usages, and this article
will mainly cover the asynchronous clients. Please refer to <<<AMRMClient>>>
and <<<NMClient>>> for more information on synchronous clients.
* Interfaces
The interfaces you'd most like be concerned with are:
* <<Client>>\<--\><<ResourceManager>>\
By using <<<YarnClient>>> objects.
* <<ApplicationMaster>>\<--\><<ResourceManager>>\
By using <<<AMRMClientAsync>>> objects, handling events asynchronously by
<<<AMRMClientAsync.CallbackHandler>>>
* <<ApplicationMaster>>\<--\><<NodeManager>>\
Launch containers. Communicate with NodeManagers
by using <<<NMClientAsync>>> objects, handling container events by
<<<NMClientAsync.CallbackHandler>>>
[]
<<Note>>
* The three main protocols for YARN application (ApplicationClientProtocol,
ApplicationMasterProtocol and ContainerManagementProtocol) are still
preserved. The 3 clients wrap these 3 protocols to provide simpler
programming model for YARN applications.
* Under very rare circumstances, programmer may want to directly use the 3
protocols to implement an application. However, note that <such behaviors
are no longer encouraged for general use cases>.
[]
* Writing a Simple Yarn Application
** Writing a simple Client
* The first step that a client needs to do is to initialize and start a
YarnClient.
+---+
YarnClient yarnClient = YarnClient.createYarnClient();
yarnClient.init(conf);
yarnClient.start();
+---+
* Once a client is set up, the client needs to create an application, and get
its application id.
+---+
YarnClientApplication app = yarnClient.createApplication();
GetNewApplicationResponse appResponse = app.getNewApplicationResponse();
+---+
* The response from the <<<YarnClientApplication>>> for a new application also
contains information about the cluster such as the minimum/maximum resource
capabilities of the cluster. This is required so that to ensure that you can
correctly set the specifications of the container in which the
ApplicationMaster would be launched. Please refer to
<<<GetNewApplicationResponse>>> for more details.
* The main crux of a client is to setup the <<<ApplicationSubmissionContext>>>
which defines all the information needed by the RM to launch the AM. A client
needs to set the following into the context:
* Application info: id, name
* Queue, priority info: Queue to which the application will be submitted,
the priority to be assigned for the application.
* User: The user submitting the application
* <<<ContainerLaunchContext>>>: The information defining the container in
which the AM will be launched and run. The <<<ContainerLaunchContext>>>, as
mentioned previously, defines all the required information needed to run
the application such as the local <<R>>esources (binaries, jars, files
etc.), <<E>>nvironment settings (CLASSPATH etc.), the <<C>>ommand to be
executed and security <<T>>okens (<RECT>).
[]
+---+
// set the application submission context
ApplicationSubmissionContext appContext = app.getApplicationSubmissionContext();
ApplicationId appId = appContext.getApplicationId();
appContext.setKeepContainersAcrossApplicationAttempts(keepContainers);
appContext.setApplicationName(appName);
// set local resources for the application master
// local files or archives as needed
// In this scenario, the jar file for the application master is part of the local resources
Map<String, LocalResource> localResources = new HashMap<String, LocalResource>();
LOG.info("Copy App Master jar from local filesystem and add to local environment");
// Copy the application master jar to the filesystem
// Create a local resource to point to the destination jar path
FileSystem fs = FileSystem.get(conf);
addToLocalResources(fs, appMasterJar, appMasterJarPath, appId.toString(),
localResources, null);
// Set the log4j properties if needed
if (!log4jPropFile.isEmpty()) {
addToLocalResources(fs, log4jPropFile, log4jPath, appId.toString(),
localResources, null);
}
// The shell script has to be made available on the final container(s)
// where it will be executed.
// To do this, we need to first copy into the filesystem that is visible
// to the yarn framework.
// We do not need to set this as a local resource for the application
// master as the application master does not need it.
String hdfsShellScriptLocation = "";
long hdfsShellScriptLen = 0;
long hdfsShellScriptTimestamp = 0;
if (!shellScriptPath.isEmpty()) {
Path shellSrc = new Path(shellScriptPath);
String shellPathSuffix =
appName + "/" + appId.toString() + "/" + SCRIPT_PATH;
Path shellDst =
new Path(fs.getHomeDirectory(), shellPathSuffix);
fs.copyFromLocalFile(false, true, shellSrc, shellDst);
hdfsShellScriptLocation = shellDst.toUri().toString();
FileStatus shellFileStatus = fs.getFileStatus(shellDst);
hdfsShellScriptLen = shellFileStatus.getLen();
hdfsShellScriptTimestamp = shellFileStatus.getModificationTime();
}
if (!shellCommand.isEmpty()) {
addToLocalResources(fs, null, shellCommandPath, appId.toString(),
localResources, shellCommand);
}
if (shellArgs.length > 0) {
addToLocalResources(fs, null, shellArgsPath, appId.toString(),
localResources, StringUtils.join(shellArgs, " "));
}
// Set the env variables to be setup in the env where the application master will be run
LOG.info("Set the environment for the application master");
Map<String, String> env = new HashMap<String, String>();
// put location of shell script into env
// using the env info, the application master will create the correct local resource for the
// eventual containers that will be launched to execute the shell scripts
env.put(DSConstants.DISTRIBUTEDSHELLSCRIPTLOCATION, hdfsShellScriptLocation);
env.put(DSConstants.DISTRIBUTEDSHELLSCRIPTTIMESTAMP, Long.toString(hdfsShellScriptTimestamp));
env.put(DSConstants.DISTRIBUTEDSHELLSCRIPTLEN, Long.toString(hdfsShellScriptLen));
// Add AppMaster.jar location to classpath
// At some point we should not be required to add
// the hadoop specific classpaths to the env.
// It should be provided out of the box.
// For now setting all required classpaths including
// the classpath to "." for the application jar
StringBuilder classPathEnv = new StringBuilder(Environment.CLASSPATH.$$())
.append(ApplicationConstants.CLASS_PATH_SEPARATOR).append("./*");
for (String c : conf.getStrings(
YarnConfiguration.YARN_APPLICATION_CLASSPATH,
YarnConfiguration.DEFAULT_YARN_CROSS_PLATFORM_APPLICATION_CLASSPATH)) {
classPathEnv.append(ApplicationConstants.CLASS_PATH_SEPARATOR);
classPathEnv.append(c.trim());
}
classPathEnv.append(ApplicationConstants.CLASS_PATH_SEPARATOR).append(
"./log4j.properties");
// Set the necessary command to execute the application master
Vector<CharSequence> vargs = new Vector<CharSequence>(30);
// Set java executable command
LOG.info("Setting up app master command");
vargs.add(Environment.JAVA_HOME.$$() + "/bin/java");
// Set Xmx based on am memory size
vargs.add("-Xmx" + amMemory + "m");
// Set class name
vargs.add(appMasterMainClass);
// Set params for Application Master
vargs.add("--container_memory " + String.valueOf(containerMemory));
vargs.add("--container_vcores " + String.valueOf(containerVirtualCores));
vargs.add("--num_containers " + String.valueOf(numContainers));
vargs.add("--priority " + String.valueOf(shellCmdPriority));
for (Map.Entry<String, String> entry : shellEnv.entrySet()) {
vargs.add("--shell_env " + entry.getKey() + "=" + entry.getValue());
}
if (debugFlag) {
vargs.add("--debug");
}
vargs.add("1>" + ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/AppMaster.stdout");
vargs.add("2>" + ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/AppMaster.stderr");
// Get final commmand
StringBuilder command = new StringBuilder();
for (CharSequence str : vargs) {
command.append(str).append(" ");
}
LOG.info("Completed setting up app master command " + command.toString());
List<String> commands = new ArrayList<String>();
commands.add(command.toString());
// Set up the container launch context for the application master
ContainerLaunchContext amContainer = ContainerLaunchContext.newInstance(
localResources, env, commands, null, null, null);
// Set up resource type requirements
// For now, both memory and vcores are supported, so we set memory and
// vcores requirements
Resource capability = Resource.newInstance(amMemory, amVCores);
appContext.setResource(capability);
// Service data is a binary blob that can be passed to the application
// Not needed in this scenario
// amContainer.setServiceData(serviceData);
// Setup security tokens
if (UserGroupInformation.isSecurityEnabled()) {
// Note: Credentials class is marked as LimitedPrivate for HDFS and MapReduce
Credentials credentials = new Credentials();
String tokenRenewer = conf.get(YarnConfiguration.RM_PRINCIPAL);
if (tokenRenewer == null || tokenRenewer.length() == 0) {
throw new IOException(
"Can't get Master Kerberos principal for the RM to use as renewer");
}
// For now, only getting tokens for the default file-system.
final Token<?> tokens[] =
fs.addDelegationTokens(tokenRenewer, credentials);
if (tokens != null) {
for (Token<?> token : tokens) {
LOG.info("Got dt for " + fs.getUri() + "; " + token);
}
}
DataOutputBuffer dob = new DataOutputBuffer();
credentials.writeTokenStorageToStream(dob);
ByteBuffer fsTokens = ByteBuffer.wrap(dob.getData(), 0, dob.getLength());
amContainer.setTokens(fsTokens);
}
appContext.setAMContainerSpec(amContainer);
+---+
* After the setup process is complete, the client is ready to submit
the application with specified priority and queue.
+---+
// Set the priority for the application master
Priority pri = Priority.newInstance(amPriority);
appContext.setPriority(pri);
// Set the queue to which this application is to be submitted in the RM
appContext.setQueue(amQueue);
// Submit the application to the applications manager
// SubmitApplicationResponse submitResp = applicationsManager.submitApplication(appRequest);
yarnClient.submitApplication(appContext);
+---+
* At this point, the RM will have accepted the application and in the
background, will go through the process of allocating a container with the
required specifications and then eventually setting up and launching the AM
on the allocated container.
* There are multiple ways a client can track progress of the actual task.
* It can communicate with the RM and request for a report of the application
via the <<<getApplicationReport()>>> method of <<<YarnClient>>>.
+-----+
// Get application report for the appId we are interested in
ApplicationReport report = yarnClient.getApplicationReport(appId);
+-----+
The <<<ApplicationReport>>> received from the RM consists of the following:
* General application information: Application id, queue to which the
application was submitted, user who submitted the application and the
start time for the application.
* ApplicationMaster details: the host on which the AM is running, the
rpc port (if any) on which it is listening for requests from clients
and a token that the client needs to communicate with the AM.
* Application tracking information: If the application supports some form
of progress tracking, it can set a tracking url which is available via
<<<ApplicationReport>>>'s <<<getTrackingUrl()>>> method that a client
can look at to monitor progress.
* Application status: The state of the application as seen by the
ResourceManager is available via
<<<ApplicationReport#getYarnApplicationState>>>. If the
<<<YarnApplicationState>>> is set to <<<FINISHED>>>, the client should
refer to <<<ApplicationReport#getFinalApplicationStatus>>> to check for
the actual success/failure of the application task itself. In case of
failures, <<<ApplicationReport#getDiagnostics>>> may be useful to shed
some more light on the the failure.
* If the ApplicationMaster supports it, a client can directly query the AM
itself for progress updates via the host:rpcport information obtained from
the application report. It can also use the tracking url obtained from the
report if available.
* In certain situations, if the application is taking too long or due to other
factors, the client may wish to kill the application. <<<YarnClient>>>
supports the <<<killApplication>>> call that allows a client to send a kill
signal to the AM via the ResourceManager. An ApplicationMaster if so
designed may also support an abort call via its rpc layer that a client may
be able to leverage.
+---+
yarnClient.killApplication(appId);
+---+
** Writing an ApplicationMaster (AM)
* The AM is the actual owner of the job. It will be launched
by the RM and via the client will be provided all the
necessary information and resources about the job that it has been tasked
with to oversee and complete.
* As the AM is launched within a container that may (likely
will) be sharing a physical host with other containers, given the
multi-tenancy nature, amongst other issues, it cannot make any assumptions
of things like pre-configured ports that it can listen on.
* When the AM starts up, several parameters are made available
to it via the environment. These include the <<<ContainerId>>> for the
AM container, the application submission time and details
about the NM (NodeManager) host running the ApplicationMaster.
Ref <<<ApplicationConstants>>> for parameter names.
* All interactions with the RM require an <<<ApplicationAttemptId>>> (there can
be multiple attempts per application in case of failures). The
<<<ApplicationAttemptId>>> can be obtained from the AM's container id. There
are helper APIs to convert the value obtained from the environment into
objects.
+---+
Map<String, String> envs = System.getenv();
String containerIdString =
envs.get(ApplicationConstants.AM_CONTAINER_ID_ENV);
if (containerIdString == null) {
// container id should always be set in the env by the framework
throw new IllegalArgumentException(
"ContainerId not set in the environment");
}
ContainerId containerId = ConverterUtils.toContainerId(containerIdString);
ApplicationAttemptId appAttemptID = containerId.getApplicationAttemptId();
+---+
* After an AM has initialized itself completely, we can start the two clients:
one to ResourceManager, and one to NodeManagers. We set them up with our
customized event handler, and we will talk about those event handlers in
detail later in this article.
+---+
AMRMClientAsync.CallbackHandler allocListener = new RMCallbackHandler();
amRMClient = AMRMClientAsync.createAMRMClientAsync(1000, allocListener);
amRMClient.init(conf);
amRMClient.start();
containerListener = createNMCallbackHandler();
nmClientAsync = new NMClientAsyncImpl(containerListener);
nmClientAsync.init(conf);
nmClientAsync.start();
+---+
* The AM has to emit heartbeats to the RM to keep it informed that the AM is
alive and still running. The timeout expiry interval at the RM is defined by
a config setting accessible via
<<<YarnConfiguration.RM_AM_EXPIRY_INTERVAL_MS>>> with the default being
defined by <<<YarnConfiguration.DEFAULT_RM_AM_EXPIRY_INTERVAL_MS>>>. The
ApplicationMaster needs to register itself with the ResourceManager to
start hearbeating.
+---+
// Register self with ResourceManager
// This will start heartbeating to the RM
appMasterHostname = NetUtils.getHostname();
RegisterApplicationMasterResponse response = amRMClient
.registerApplicationMaster(appMasterHostname, appMasterRpcPort,
appMasterTrackingUrl);
+---+
* In the response of the registration, maximum resource capability if included. You may want to use this to check the application's request.
+---+
// Dump out information about cluster capability as seen by the
// resource manager
int maxMem = response.getMaximumResourceCapability().getMemory();
LOG.info("Max mem capabililty of resources in this cluster " + maxMem);
int maxVCores = response.getMaximumResourceCapability().getVirtualCores();
LOG.info("Max vcores capabililty of resources in this cluster " + maxVCores);
// A resource ask cannot exceed the max.
if (containerMemory > maxMem) {
LOG.info("Container memory specified above max threshold of cluster."
+ " Using max value." + ", specified=" + containerMemory + ", max="
+ maxMem);
containerMemory = maxMem;
}
if (containerVirtualCores > maxVCores) {
LOG.info("Container virtual cores specified above max threshold of cluster."
+ " Using max value." + ", specified=" + containerVirtualCores + ", max="
+ maxVCores);
containerVirtualCores = maxVCores;
}
List<Container> previousAMRunningContainers =
response.getContainersFromPreviousAttempts();
LOG.info("Received " + previousAMRunningContainers.size()
+ " previous AM's running containers on AM registration.");
+---+
* Based on the task requirements, the AM can ask for a set of containers to run
its tasks on. We can now calculate how many containers we need, and request
those many containers.
+---+
List<Container> previousAMRunningContainers =
response.getContainersFromPreviousAttempts();
List<Container> previousAMRunningContainers =
response.getContainersFromPreviousAttempts();
LOG.info("Received " + previousAMRunningContainers.size()
+ " previous AM's running containers on AM registration.");
int numTotalContainersToRequest =
numTotalContainers - previousAMRunningContainers.size();
// Setup ask for containers from RM
// Send request for containers to RM
// Until we get our fully allocated quota, we keep on polling RM for
// containers
// Keep looping until all the containers are launched and shell script
// executed on them ( regardless of success/failure).
for (int i = 0; i < numTotalContainersToRequest; ++i) {
ContainerRequest containerAsk = setupContainerAskForRM();
amRMClient.addContainerRequest(containerAsk);
}
+---+
* In <<<setupContainerAskForRM()>>>, the follow two things need some set up:
* Resource capability: Currently, YARN supports memory based resource
requirements so the request should define how much memory is needed. The
value is defined in MB and has to less than the max capability of the
cluster and an exact multiple of the min capability. Memory resources
correspond to physical memory limits imposed on the task containers. It
will also support computation based resource (vCore), as shown in the code.
* Priority: When asking for sets of containers, an AM may define different
priorities to each set. For example, the Map-Reduce AM may assign a higher
priority to containers needed for the Map tasks and a lower priority for
the Reduce tasks' containers.
[]
+---+
private ContainerRequest setupContainerAskForRM() {
// setup requirements for hosts
// using * as any host will do for the distributed shell app
// set the priority for the request
Priority pri = Priority.newInstance(requestPriority);
// Set up resource type requirements
// For now, memory and CPU are supported so we set memory and cpu requirements
Resource capability = Resource.newInstance(containerMemory,
containerVirtualCores);
ContainerRequest request = new ContainerRequest(capability, null, null,
pri);
LOG.info("Requested container ask: " + request.toString());
return request;
}
+---+
* After container allocation requests have been sent by the application
manager, contailers will be launched asynchronously, by the event handler of
the <<<AMRMClientAsync>>> client. The handler should implement
<<<AMRMClientAsync.CallbackHandler>>> interface.
* When there are containers allocated, the handler sets up a thread that runs
the code to launch containers. Here we use the name
<<<LaunchContainerRunnable>>> to demonstrate. We will talk about the
<<<LaunchContainerRunnable>>> class in the following part of this article.
+---+
@Override
public void onContainersAllocated(List<Container> allocatedContainers) {
LOG.info("Got response from RM for container ask, allocatedCnt="
+ allocatedContainers.size());
numAllocatedContainers.addAndGet(allocatedContainers.size());
for (Container allocatedContainer : allocatedContainers) {
LaunchContainerRunnable runnableLaunchContainer =
new LaunchContainerRunnable(allocatedContainer, containerListener);
Thread launchThread = new Thread(runnableLaunchContainer);
// launch and start the container on a separate thread to keep
// the main thread unblocked
// as all containers may not be allocated at one go.
launchThreads.add(launchThread);
launchThread.start();
}
}
+---+
* On heart beat, the event handler reports the progress of the application.
+---+
@Override
public float getProgress() {
// set progress to deliver to RM on next heartbeat
float progress = (float) numCompletedContainers.get()
/ numTotalContainers;
return progress;
}
+---+
[]
* The container launch thread actually launches the containers on NMs. After a
container has been allocated to the AM, it needs to follow a similar process
that the client followed in setting up the <<<ContainerLaunchContext>>> for
the eventual task that is going to be running on the allocated Container.
Once the <<<ContainerLaunchContext>>> is defined, the AM can start it through
the <<<NMClientAsync>>>.
+---+
// Set the necessary command to execute on the allocated container
Vector<CharSequence> vargs = new Vector<CharSequence>(5);
// Set executable command
vargs.add(shellCommand);
// Set shell script path
if (!scriptPath.isEmpty()) {
vargs.add(Shell.WINDOWS ? ExecBatScripStringtPath
: ExecShellStringPath);
}
// Set args for the shell command if any
vargs.add(shellArgs);
// Add log redirect params
vargs.add("1>" + ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/stdout");
vargs.add("2>" + ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/stderr");
// Get final commmand
StringBuilder command = new StringBuilder();
for (CharSequence str : vargs) {
command.append(str).append(" ");
}
List<String> commands = new ArrayList<String>();
commands.add(command.toString());
// Set up ContainerLaunchContext, setting local resource, environment,
// command and token for constructor.
// Note for tokens: Set up tokens for the container too. Today, for normal
// shell commands, the container in distribute-shell doesn't need any
// tokens. We are populating them mainly for NodeManagers to be able to
// download anyfiles in the distributed file-system. The tokens are
// otherwise also useful in cases, for e.g., when one is running a
// "hadoop dfs" command inside the distributed shell.
ContainerLaunchContext ctx = ContainerLaunchContext.newInstance(
localResources, shellEnv, commands, null, allTokens.duplicate(), null);
containerListener.addContainer(container.getId(), container);
nmClientAsync.startContainerAsync(container, ctx);
+---+
* The <<<NMClientAsync>>> object, together with its event handler, handles container events. Including container start, stop, status update, and occurs an error.
* After the ApplicationMaster determines the work is done, it needs to unregister itself through the AM-RM client, and then stops the client.
+---+
try {
amRMClient.unregisterApplicationMaster(appStatus, appMessage, null);
} catch (YarnException ex) {
LOG.error("Failed to unregister application", ex);
} catch (IOException e) {
LOG.error("Failed to unregister application", e);
}
amRMClient.stop();
+---+
~~** Defining the context in which your code runs
~~*** Container Resource Requests
~~*** Local Resources
~~*** Environment
~~**** Managing the CLASSPATH
~~** Security
* FAQ
** How can I distribute my application's jars to all of the nodes in the YARN
cluster that need it?
* You can use the LocalResource to add resources to your application request.
This will cause YARN to distribute the resource to the ApplicationMaster
node. If the resource is a tgz, zip, or jar - you can have YARN unzip it.
Then, all you need to do is add the unzipped folder to your classpath. For
example, when creating your application request:
+---+
File packageFile = new File(packagePath);
Url packageUrl = ConverterUtils.getYarnUrlFromPath(
FileContext.getFileContext.makeQualified(new Path(packagePath)));
packageResource.setResource(packageUrl);
packageResource.setSize(packageFile.length());
packageResource.setTimestamp(packageFile.lastModified());
packageResource.setType(LocalResourceType.ARCHIVE);
packageResource.setVisibility(LocalResourceVisibility.APPLICATION);
resource.setMemory(memory);
containerCtx.setResource(resource);
containerCtx.setCommands(ImmutableList.of(
"java -cp './package/*' some.class.to.Run "
+ "1>" + ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/stdout "
+ "2>" + ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/stderr"));
containerCtx.setLocalResources(
Collections.singletonMap("package", packageResource));
appCtx.setApplicationId(appId);
appCtx.setUser(user.getShortUserName);
appCtx.setAMContainerSpec(containerCtx);
yarnClient.submitApplication(appCtx);
+---+
As you can see, the <<<setLocalResources>>> command takes a map of names to
resources. The name becomes a sym link in your application's cwd, so you can
just refer to the artifacts inside by using ./package/*.
Note: Java's classpath (cp) argument is VERY sensitive.
Make sure you get the syntax EXACTLY correct.
Once your package is distributed to your AM, you'll need to follow the same
process whenever your AM starts a new container (assuming you want the
resources to be sent to your container). The code for this is the same. You
just need to make sure that you give your AM the package path (either HDFS, or
local), so that it can send the resource URL along with the container ctx.
** How do I get the ApplicationMaster's <<<ApplicationAttemptId>>>?
* The <<<ApplicationAttemptId>>> will be passed to the AM via the environment
and the value from the environment can be converted into an
<<<ApplicationAttemptId>>> object via the ConverterUtils helper function.
** Why my container is killed by the NodeManager?
* This is likely due to high memory usage exceeding your requested container
memory size. There are a number of reasons that can cause this. First, look
at the process tree that the NodeManager dumps when it kills your container.
The two things you're interested in are physical memory and virtual memory.
If you have exceeded physical memory limits your app is using too much
physical memory. If you're running a Java app, you can use -hprof to look at
what is taking up space in the heap. If you have exceeded virtual memory, you
may need to increase the value of the the cluster-wide configuration variable
<<<yarn.nodemanager.vmem-pmem-ratio>>>.
** How do I include native libraries?
* Setting <<<-Djava.library.path>>> on the command line while launching a
container can cause native libraries used by Hadoop to not be loaded
correctly and can result in errors. It is cleaner to use
<<<LD_LIBRARY_PATH>>> instead.
* Useful Links
* {{{http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html}YARN Architecture}}
* {{{http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html}YARN Capacity Scheduler}}
* {{{http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html}YARN Fair Scheduler}}
* Sample code
* Yarn distributed shell: in <<<hadoop-yarn-applications-distributedshell>>>
project after you set up your development environment.

View File

@ -1,77 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
YARN
---
---
${maven.build.timestamp}
Apache Hadoop NextGen MapReduce (YARN)
MapReduce has undergone a complete overhaul in hadoop-0.23 and we now have,
what we call, MapReduce 2.0 (MRv2) or YARN.
The fundamental idea of MRv2 is to split up the two major functionalities of
the JobTracker, resource management and job scheduling/monitoring, into
separate daemons. The idea is to have a global ResourceManager (<RM>) and
per-application ApplicationMaster (<AM>). An application is either a single
job in the classical sense of Map-Reduce jobs or a DAG of jobs.
The ResourceManager and per-node slave, the NodeManager (<NM>), form the
data-computation framework. The ResourceManager is the ultimate authority that
arbitrates resources among all the applications in the system.
The per-application ApplicationMaster is, in effect, a framework specific
library and is tasked with negotiating resources from the ResourceManager and
working with the NodeManager(s) to execute and monitor the tasks.
[./yarn_architecture.gif] MapReduce NextGen Architecture
The ResourceManager has two main components: Scheduler and
ApplicationsManager.
The Scheduler is responsible for allocating resources to the various running
applications subject to familiar constraints of capacities, queues etc. The
Scheduler is pure scheduler in the sense that it performs no monitoring or
tracking of status for the application. Also, it offers no guarantees about
restarting failed tasks either due to application failure or hardware
failures. The Scheduler performs its scheduling function based the resource
requirements of the applications; it does so based on the abstract notion of
a resource <Container> which incorporates elements such as memory, cpu, disk,
network etc. In the first version, only <<<memory>>> is supported.
The Scheduler has a pluggable policy plug-in, which is responsible for
partitioning the cluster resources among the various queues, applications etc.
The current Map-Reduce schedulers such as the CapacityScheduler and the
FairScheduler would be some examples of the plug-in.
The CapacityScheduler supports <<<hierarchical queues>>> to allow for more
predictable sharing of cluster resources
The ApplicationsManager is responsible for accepting job-submissions,
negotiating the first container for executing the application specific
ApplicationMaster and provides the service for restarting the
ApplicationMaster container on failure.
The NodeManager is the per-machine framework agent who is responsible for
containers, monitoring their resource usage (cpu, memory, disk, network) and
reporting the same to the ResourceManager/Scheduler.
The per-application ApplicationMaster has the responsibility of negotiating
appropriate resource containers from the Scheduler, tracking their status and
monitoring for progress.
MRV2 maintains <<API compatibility>> with previous stable release
(hadoop-1.x). This means that all Map-Reduce jobs should still run
unchanged on top of MRv2 with just a recompile.

View File

@ -1,369 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
YARN Commands
---
---
${maven.build.timestamp}
YARN Commands
%{toc|section=1|fromDepth=0}
* Overview
YARN commands are invoked by the bin/yarn script. Running the yarn script
without any arguments prints the description for all commands.
Usage: <<<yarn [SHELL_OPTIONS] COMMAND [GENERIC_OPTIONS] [COMMAND_OPTIONS]>>>
YARN has an option parsing framework that employs parsing generic options as
well as running classes.
*---------------+--------------+
|| COMMAND_OPTIONS || Description |
*-------------------------+-------------+
| SHELL_OPTIONS | The common set of shell options. These are documented on the {{{../../hadoop-project-dist/hadoop-common/CommandsManual.html#Shell Options}Commands Manual}} page.
*-------------------------+----+
| GENERIC_OPTIONS | The common set of options supported by multiple commands. See the Hadoop {{{../../hadoop-project-dist/hadoop-common/CommandsManual.html#Generic Options}Commands Manual}} for more information.
*------------------+---------------+
| COMMAND COMMAND_OPTIONS | Various commands with their options are described
| | in the following sections. The commands have been
| | grouped into {{User Commands}} and
| | {{Administration Commands}}.
*-------------------------+--------------+
* {User Commands}
Commands useful for users of a Hadoop cluster.
** <<<application>>>
Usage: <<<yarn application [options] >>>
*---------------+--------------+
|| COMMAND_OPTIONS || Description |
*---------------+--------------+
| -appStates States | Works with -list to filter applications based on input
| | comma-separated list of application states. The valid
| | application state can be one of the following: \
| | ALL, NEW, NEW_SAVING, SUBMITTED, ACCEPTED, RUNNING,
| | FINISHED, FAILED, KILLED
*---------------+--------------+
| -appTypes Types | Works with -list to filter applications based on input
| | comma-separated list of application types.
*---------------+--------------+
| -list | Lists applications from the RM. Supports optional use of -appTypes
| | to filter applications based on application type, and -appStates to
| | filter applications based on application state.
*---------------+--------------+
| -kill ApplicationId | Kills the application.
*---------------+--------------+
| -status ApplicationId | Prints the status of the application.
*---------------+--------------+
Prints application(s) report/kill application
** <<<applicationattempt>>>
Usage: <<<yarn applicationattempt [options] >>>
*---------------+--------------+
|| COMMAND_OPTIONS || Description |
*---------------+--------------+
| -help | Help
*---------------+--------------+
| -list ApplicationId | Lists applications attempts from the RM
*---------------+--------------+
| -status Application Attempt Id | Prints the status of the application attempt.
*---------------+--------------+
prints applicationattempt(s) report
** <<<classpath>>>
Usage: <<<yarn classpath>>>
Prints the class path needed to get the Hadoop jar and the required libraries
** <<<container>>>
Usage: <<<yarn container [options] >>>
*---------------+--------------+
|| COMMAND_OPTIONS || Description |
*---------------+--------------+
| -help | Help
*---------------+--------------+
| -list ApplicationId | Lists containers for the application attempt.
*---------------+--------------+
| -status ContainerId | Prints the status of the container.
*---------------+--------------+
prints container(s) report
** <<<jar>>>
Usage: <<<yarn jar <jar> [mainClass] args... >>>
Runs a jar file. Users can bundle their YARN code in a jar file and execute
it using this command.
** <<<logs>>>
Usage: <<<yarn logs -applicationId <application ID> [options] >>>
*---------------+--------------+
|| COMMAND_OPTIONS || Description |
*---------------+--------------+
| -applicationId \<application ID\> | Specifies an application id |
*---------------+--------------+
| -appOwner AppOwner | AppOwner (assumed to be current user if not
| | specified)
*---------------+--------------+
| -containerId ContainerId | ContainerId (must be specified if node address is
| | specified)
*---------------+--------------+
| -help | Help
*---------------+--------------+
| -nodeAddress NodeAddress | NodeAddress in the format nodename:port (must be
| | specified if container id is specified)
*---------------+--------------+
Dump the container logs
** <<<node>>>
Usage: <<<yarn node [options] >>>
*---------------+--------------+
|| COMMAND_OPTIONS || Description |
*---------------+--------------+
| -all | Works with -list to list all nodes.
*---------------+--------------+
| -list | Lists all running nodes. Supports optional use of -states to filter
| | nodes based on node state, and -all to list all nodes.
*---------------+--------------+
| -states States | Works with -list to filter nodes based on input
| | comma-separated list of node states.
*---------------+--------------+
| -status NodeId | Prints the status report of the node.
*---------------+--------------+
Prints node report(s)
** <<<queue>>>
Usage: <<<yarn queue [options] >>>
*---------------+--------------+
|| COMMAND_OPTIONS || Description |
*---------------+--------------+
| -help | Help
*---------------+--------------+
| -status QueueName | Prints the status of the queue.
*---------------+--------------+
Prints queue information
** <<<version>>>
Usage: <<<yarn version>>>
Prints the Hadoop version.
* {Administration Commands}
Commands useful for administrators of a Hadoop cluster.
** <<<daemonlog>>>
Usage:
---------------------------------
yarn daemonlog -getlevel <host:httpport> <classname>
yarn daemonlog -setlevel <host:httpport> <classname> <level>
---------------------------------
*---------------+--------------+
|| COMMAND_OPTIONS || Description |
*---------------+--------------+
| -getlevel \<host:httpport\> \<classname\> | Prints the log level of the log identified
| | by a qualified \<classname\>, in the daemon running at \<host:httpport\>. This
| | command internally connects to http://\<host:httpport\>/logLevel?log=\<classname\>
*---------------+--------------+
| -setlevel \<host:httpport\> \<classname\> \<level\> | Sets the log level of the log
| | identified by a qualified \<classname\> in the daemon running at \<host:httpport\>.
| | This command internally connects to http://\<host:httpport\>/logLevel?log=\<classname\>&level=\<level\>
*---------------+--------------+
Get/Set the log level for a Log identified by a qualified class name in the daemon.
----
Example: $ bin/yarn daemonlog -setlevel 127.0.0.1:8088 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl DEBUG
----
** <<<nodemanager>>>
Usage: <<<yarn nodemanager>>>
Start the NodeManager
** <<<proxyserver>>>
Usage: <<<yarn proxyserver>>>
Start the web proxy server
** <<<resourcemanager>>>
Usage: <<<yarn resourcemanager [-format-state-store]>>>
*---------------+--------------+
|| COMMAND_OPTIONS || Description |
*---------------+--------------+
| -format-state-store | Formats the RMStateStore. This will clear the
| | RMStateStore and is useful if past applications are no
| | longer needed. This should be run only when the
| | ResourceManager is not running.
*---------------+--------------+
Start the ResourceManager
** <<<rmadmin>>>
Usage:
----
yarn rmadmin [-refreshQueues]
[-refreshNodes]
[-refreshUserToGroupsMapping]
[-refreshSuperUserGroupsConfiguration]
[-refreshAdminAcls]
[-refreshServiceAcl]
[-getGroups [username]]
[-transitionToActive [--forceactive] [--forcemanual] <serviceId>]
[-transitionToStandby [--forcemanual] <serviceId>]
[-failover [--forcefence] [--forceactive] <serviceId1> <serviceId2>]
[-getServiceState <serviceId>]
[-checkHealth <serviceId>]
[-help [cmd]]
----
*---------------+--------------+
|| COMMAND_OPTIONS || Description |
*---------------+--------------+
| -refreshQueues | Reload the queues' acls, states and scheduler specific
| | properties. ResourceManager will reload the mapred-queues
| | configuration file.
*---------------+--------------+
| -refreshNodes | Refresh the hosts information at the ResourceManager. |
*---------------+--------------+
| -refreshUserToGroupsMappings| Refresh user-to-groups mappings. |
*---------------+--------------+
| -refreshSuperUserGroupsConfiguration | Refresh superuser proxy groups
| | mappings.
*---------------+--------------+
| -refreshAdminAcls | Refresh acls for administration of ResourceManager |
*---------------+--------------+
| -refreshServiceAcl | Reload the service-level authorization policy file
| | ResourceManager will reload the authorization policy
| | file.
*---------------+--------------+
| -getGroups [username] | Get groups the specified user belongs to.
*---------------+--------------+
| -transitionToActive [--forceactive] [--forcemanual] \<serviceId\> |
| | Transitions the service into Active state.
| | Try to make the target active
| | without checking that there is no active node
| | if the --forceactive option is used.
| | This command can not be used if automatic failover is enabled.
| | Though you can override this by --forcemanual option,
| | you need caution.
*---------------+--------------+
| -transitionToStandby [--forcemanual] \<serviceId\> |
| | Transitions the service into Standby state.
| | This command can not be used if automatic failover is enabled.
| | Though you can override this by --forcemanual option,
| | you need caution.
*---------------+--------------+
| -failover [--forceactive] \<serviceId1\> \<serviceId2\> |
| | Initiate a failover from serviceId1 to serviceId2.
| | Try to failover to the target service even if it is not ready
| | if the --forceactive option is used.
| | This command can not be used if automatic failover is enabled.
*---------------+--------------+
| -getServiceState \<serviceId\> | Returns the state of the service.
*---------------+--------------+
| -checkHealth \<serviceId\> | Requests that the service perform a health
| | check. The RMAdmin tool will exit with a
| | non-zero exit code if the check fails.
*---------------+--------------+
| -help [cmd] | Displays help for the given command or all commands if none is
| | specified.
*---------------+--------------+
Runs ResourceManager admin client
** scmadmin
Usage: <<<yarn scmadmin [options] >>>
*---------------+--------------+
|| COMMAND_OPTIONS || Description |
*---------------+--------------+
| -help | Help
*---------------+--------------+
| -runCleanerTask | Runs the cleaner task
*---------------+--------------+
Runs Shared Cache Manager admin client
** sharedcachemanager
Usage: <<<yarn sharedcachemanager>>>
Start the Shared Cache Manager
** timelineserver
Usage: <<<yarn timelineserver>>>
Start the TimeLineServer
* Files
** <<etc/hadoop/hadoop-env.sh>>
This file stores the global settings used by all Hadoop shell commands.
** <<etc/hadoop/yarn-env.sh>>
This file stores overrides used by all YARN shell commands.
** <<etc/hadoop/hadoop-user-functions.sh>>
This file allows for advanced users to override some shell functionality.
** <<~/.hadooprc>>
This stores the personal environment for an individual user. It is
processed after the <<<hadoop-env.sh>>>, <<<hadoop-user-functions.sh>>>, and <<<yarn-env.sh>>> files
and can contain the same settings.

View File

@ -1,82 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
Apache Hadoop NextGen MapReduce
---
---
${maven.build.timestamp}
MapReduce NextGen aka YARN aka MRv2
The new architecture introduced in hadoop-0.23, divides the two major
functions of the JobTracker: resource management and job life-cycle management
into separate components.
The new ResourceManager manages the global assignment of compute resources to
applications and the per-application ApplicationMaster manages the
applications scheduling and coordination.
An application is either a single job in the sense of classic MapReduce jobs
or a DAG of such jobs.
The ResourceManager and per-machine NodeManager daemon, which manages the
user processes on that machine, form the computation fabric.
The per-application ApplicationMaster is, in effect, a framework specific
library and is tasked with negotiating resources from the ResourceManager and
working with the NodeManager(s) to execute and monitor the tasks.
More details are available in the {{{./YARN.html}Architecture}} document.
Documentation Index
* YARN
* {{{./YARN.html}YARN Architecture}}
* {{{./CapacityScheduler.html}Capacity Scheduler}}
* {{{./FairScheduler.html}Fair Scheduler}}
* {{{./ResourceManagerRestart.htaml}ResourceManager Restart}}
* {{{./ResourceManagerHA.html}ResourceManager HA}}
* {{{./WebApplicationProxy.html}Web Application Proxy}}
* {{{./TimelineServer.html}YARN Timeline Server}}
* {{{./WritingYarnApplications.html}Writing YARN Applications}}
* {{{./YarnCommands.html}YARN Commands}}
* {{{hadoop-sls/SchedulerLoadSimulator.html}Scheduler Load Simulator}}
* {{{./NodeManagerRestart.html}NodeManager Restart}}
* {{{./DockerContainerExecutor.html}DockerContainerExecutor}}
* {{{./NodeManagerCGroups.html}Using CGroups}}
* {{{./SecureContainer.html}Secure Containers}}
* {{{./registry/index.html}Registry}}
* YARN REST APIs
* {{{./WebServicesIntro.html}Introduction}}
* {{{./ResourceManagerRest.html}Resource Manager}}
* {{{./NodeManagerRest.html}Node Manager}}

View File

@ -0,0 +1,186 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
Hadoop: Capacity Scheduler
==========================
* [Purpose](#Purpose)
* [Overview](#Overview)
* [Features](#Features)
* [Configuration](#Configuration)
* [Setting up `ResourceManager` to use `CapacityScheduler`](#Setting_up_ResourceManager_to_use_CapacityScheduler`)
* [Setting up queues](#Setting_up_queues)
* [Queue Properties](#Queue_Properties)
* [Other Properties](#Other_Properties)
* [Reviewing the configuration of the CapacityScheduler](#Reviewing_the_configuration_of_the_CapacityScheduler)
* [Changing Queue Configuration](#Changing_Queue_Configuration)
Purpose
-------
This document describes the `CapacityScheduler`, a pluggable scheduler for Hadoop which allows for multiple-tenants to securely share a large cluster such that their applications are allocated resources in a timely manner under constraints of allocated capacities.
Overview
--------
The `CapacityScheduler` is designed to run Hadoop applications as a shared, multi-tenant cluster in an operator-friendly manner while maximizing the throughput and the utilization of the cluster.
Traditionally each organization has it own private set of compute resources that have sufficient capacity to meet the organization's SLA under peak or near peak conditions. This generally leads to poor average utilization and overhead of managing multiple independent clusters, one per each organization. Sharing clusters between organizations is a cost-effective manner of running large Hadoop installations since this allows them to reap benefits of economies of scale without creating private clusters. However, organizations are concerned about sharing a cluster because they are worried about others using the resources that are critical for their SLAs.
The `CapacityScheduler` is designed to allow sharing a large cluster while giving each organization capacity guarantees. The central idea is that the available resources in the Hadoop cluster are shared among multiple organizations who collectively fund the cluster based on their computing needs. There is an added benefit that an organization can access any excess capacity not being used by others. This provides elasticity for the organizations in a cost-effective manner.
Sharing clusters across organizations necessitates strong support for multi-tenancy since each organization must be guaranteed capacity and safe-guards to ensure the shared cluster is impervious to single rouge application or user or sets thereof. The `CapacityScheduler` provides a stringent set of limits to ensure that a single application or user or queue cannot consume disproportionate amount of resources in the cluster. Also, the `CapacityScheduler` provides limits on initialized/pending applications from a single user and queue to ensure fairness and stability of the cluster.
The primary abstraction provided by the `CapacityScheduler` is the concept of *queues*. These queues are typically setup by administrators to reflect the economics of the shared cluster.
To provide further control and predictability on sharing of resources, the `CapacityScheduler` supports *hierarchical queues* to ensure resources are shared among the sub-queues of an organization before other queues are allowed to use free resources, there-by providing *affinity* for sharing free resources among applications of a given organization.
Features
--------
The `CapacityScheduler` supports the following features:
* **Hierarchical Queues** - Hierarchy of queues is supported to ensure resources are shared among the sub-queues of an organization before other queues are allowed to use free resources, there-by providing more control and predictability.
* **Capacity Guarantees** - Queues are allocated a fraction of the capacity of the grid in the sense that a certain capacity of resources will be at their disposal. All applications submitted to a queue will have access to the capacity allocated to the queue. Adminstrators can configure soft limits and optional hard limits on the capacity allocated to each queue.
* **Security** - Each queue has strict ACLs which controls which users can submit applications to individual queues. Also, there are safe-guards to ensure that users cannot view and/or modify applications from other users. Also, per-queue and system administrator roles are supported.
* **Elasticity** - Free resources can be allocated to any queue beyond it's capacity. When there is demand for these resources from queues running below capacity at a future point in time, as tasks scheduled on these resources complete, they will be assigned to applications on queues running below the capacity (pre-emption is not supported). This ensures that resources are available in a predictable and elastic manner to queues, thus preventing artifical silos of resources in the cluster which helps utilization.
* **Multi-tenancy** - Comprehensive set of limits are provided to prevent a single application, user and queue from monopolizing resources of the queue or the cluster as a whole to ensure that the cluster isn't overwhelmed.
* **Operability**
* Runtime Configuration - The queue definitions and properties such as capacity, ACLs can be changed, at runtime, by administrators in a secure manner to minimize disruption to users. Also, a console is provided for users and administrators to view current allocation of resources to various queues in the system. Administrators can *add additional queues* at runtime, but queues cannot be *deleted* at runtime.
* Drain applications - Administrators can *stop* queues at runtime to ensure that while existing applications run to completion, no new applications can be submitted. If a queue is in `STOPPED` state, new applications cannot be submitted to *itself* or *any of its child queueus*. Existing applications continue to completion, thus the queue can be *drained* gracefully. Administrators can also *start* the stopped queues.
* **Resource-based Scheduling** - Support for resource-intensive applications, where-in a application can optionally specify higher resource-requirements than the default, there-by accomodating applications with differing resource requirements. Currently, *memory* is the the resource requirement supported.
Configuration
-------------
###Setting up `ResourceManager` to use `CapacityScheduler`
To configure the `ResourceManager` to use the `CapacityScheduler`, set the following property in the **conf/yarn-site.xml**:
| Property | Value |
|:---- |:---- |
| `yarn.resourcemanager.scheduler.class` | `org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler` |
###Setting up queues
`etc/hadoop/capacity-scheduler.xml` is the configuration file for the `CapacityScheduler`.
The `CapacityScheduler` has a pre-defined queue called *root*. All queueus in the system are children of the root queue.
Further queues can be setup by configuring `yarn.scheduler.capacity.root.queues` with a list of comma-separated child queues.
The configuration for `CapacityScheduler` uses a concept called *queue path* to configure the hierarchy of queues. The *queue path* is the full path of the queue's hierarchy, starting at *root*, with . (dot) as the delimiter.
A given queue's children can be defined with the configuration knob: `yarn.scheduler.capacity.<queue-path>.queues`. Children do not inherit properties directly from the parent unless otherwise noted.
Here is an example with three top-level child-queues `a`, `b` and `c` and some sub-queues for `a` and `b`:
```xml
<property>
<name>yarn.scheduler.capacity.root.queues</name>
<value>a,b,c</value>
<description>The queues at the this level (root is the root queue).
</description>
</property>
<property>
<name>yarn.scheduler.capacity.root.a.queues</name>
<value>a1,a2</value>
<description>The queues at the this level (root is the root queue).
</description>
</property>
<property>
<name>yarn.scheduler.capacity.root.b.queues</name>
<value>b1,b2,b3</value>
<description>The queues at the this level (root is the root queue).
</description>
</property>
```
###Queue Properties
* Resource Allocation
| Property | Description |
|:---- |:---- |
| `yarn.scheduler.capacity.<queue-path>.capacity` | Queue *capacity* in percentage (%) as a float (e.g. 12.5). The sum of capacities for all queues, at each level, must be equal to 100. Applications in the queue may consume more resources than the queue's capacity if there are free resources, providing elasticity. |
| `yarn.scheduler.capacity.<queue-path>.maximum-capacity` | Maximum queue capacity in percentage (%) as a float. This limits the *elasticity* for applications in the queue. Defaults to -1 which disables it. |
| `yarn.scheduler.capacity.<queue-path>.minimum-user-limit-percent` | Each queue enforces a limit on the percentage of resources allocated to a user at any given time, if there is demand for resources. The user limit can vary between a minimum and maximum value. The the former (the minimum value) is set to this property value and the latter (the maximum value) depends on the number of users who have submitted applications. For e.g., suppose the value of this property is 25. If two users have submitted applications to a queue, no single user can use more than 50% of the queue resources. If a third user submits an application, no single user can use more than 33% of the queue resources. With 4 or more users, no user can use more than 25% of the queues resources. A value of 100 implies no user limits are imposed. The default is 100. Value is specified as a integer. |
| `yarn.scheduler.capacity.<queue-path>.user-limit-factor` | The multiple of the queue capacity which can be configured to allow a single user to acquire more resources. By default this is set to 1 which ensures that a single user can never take more than the queue's configured capacity irrespective of how idle th cluster is. Value is specified as a float. |
| `yarn.scheduler.capacity.<queue-path>.maximum-allocation-mb` | The per queue maximum limit of memory to allocate to each container request at the Resource Manager. This setting overrides the cluster configuration `yarn.scheduler.maximum-allocation-mb`. This value must be smaller than or equal to the cluster maximum. |
| `yarn.scheduler.capacity.<queue-path>.maximum-allocation-vcores` | The per queue maximum limit of virtual cores to allocate to each container request at the Resource Manager. This setting overrides the cluster configuration `yarn.scheduler.maximum-allocation-vcores`. This value must be smaller than or equal to the cluster maximum. |
* Running and Pending Application Limits
The `CapacityScheduler` supports the following parameters to control the running and pending applications:
| Property | Description |
|:---- |:---- |
| `yarn.scheduler.capacity.maximum-applications` / `yarn.scheduler.capacity.<queue-path>.maximum-applications` | Maximum number of applications in the system which can be concurrently active both running and pending. Limits on each queue are directly proportional to their queue capacities and user limits. This is a hard limit and any applications submitted when this limit is reached will be rejected. Default is 10000. This can be set for all queues with `yarn.scheduler.capacity.maximum-applications` and can also be overridden on a per queue basis by setting `yarn.scheduler.capacity.<queue-path>.maximum-applications`. Integer value expected. |
| `yarn.scheduler.capacity.maximum-am-resource-percent` / `yarn.scheduler.capacity.<queue-path>.maximum-am-resource-percent` | Maximum percent of resources in the cluster which can be used to run application masters - controls number of concurrent active applications. Limits on each queue are directly proportional to their queue capacities and user limits. Specified as a float - ie 0.5 = 50%. Default is 10%. This can be set for all queues with `yarn.scheduler.capacity.maximum-am-resource-percent` and can also be overridden on a per queue basis by setting `yarn.scheduler.capacity.<queue-path>.maximum-am-resource-percent` |
* Queue Administration & Permissions
The `CapacityScheduler` supports the following parameters to the administer the queues:
| Property | Description |
|:---- |:---- |
| `yarn.scheduler.capacity.<queue-path>.state` | The *state* of the queue. Can be one of `RUNNING` or `STOPPED`. If a queue is in `STOPPED` state, new applications cannot be submitted to *itself* or *any of its child queues*. Thus, if the *root* queue is `STOPPED` no applications can be submitted to the entire cluster. Existing applications continue to completion, thus the queue can be *drained* gracefully. Value is specified as Enumeration. |
| `yarn.scheduler.capacity.root.<queue-path>.acl_submit_applications` | The *ACL* which controls who can *submit* applications to the given queue. If the given user/group has necessary ACLs on the given queue or *one of the parent queues in the hierarchy* they can submit applications. *ACLs* for this property *are* inherited from the parent queue if not specified. |
| `yarn.scheduler.capacity.root.<queue-path>.acl_administer_queue` | The *ACL* which controls who can *administer* applications on the given queue. If the given user/group has necessary ACLs on the given queue or *one of the parent queues in the hierarchy* they can administer applications. *ACLs* for this property *are* inherited from the parent queue if not specified. |
**Note:** An *ACL* is of the form *user1*, *user2spacegroup1*, *group2*. The special value of * implies *anyone*. The special value of *space* implies *no one*. The default is * for the root queue if not specified.
###Other Properties
* Resource Calculator
| Property | Description |
|:---- |:---- |
| `yarn.scheduler.capacity.resource-calculator` | The ResourceCalculator implementation to be used to compare Resources in the scheduler. The default i.e. org.apache.hadoop.yarn.util.resource.DefaultResourseCalculator only uses Memory while DominantResourceCalculator uses Dominant-resource to compare multi-dimensional resources such as Memory, CPU etc. A Java ResourceCalculator class name is expected. |
* Data Locality
| Property | Description |
|:---- |:---- |
| `yarn.scheduler.capacity.node-locality-delay` | Number of missed scheduling opportunities after which the CapacityScheduler attempts to schedule rack-local containers. Typically, this should be set to number of nodes in the cluster. By default is setting approximately number of nodes in one rack which is 40. Positive integer value is expected. |
###Reviewing the configuration of the CapacityScheduler
Once the installation and configuration is completed, you can review it after starting the YARN cluster from the web-ui.
* Start the YARN cluster in the normal manner.
* Open the `ResourceManager` web UI.
* The */scheduler* web-page should show the resource usages of individual queues.
Changing Queue Configuration
----------------------------
Changing queue properties and adding new queues is very simple. You need to edit **conf/capacity-scheduler.xml** and run *yarn rmadmin -refreshQueues*.
$ vi $HADOOP_CONF_DIR/capacity-scheduler.xml
$ $HADOOP_YARN_HOME/bin/yarn rmadmin -refreshQueues
**Note:** Queues cannot be *deleted*, only addition of new queues is supported - the updated queue configuration should be a valid one i.e. queue-capacity at each *level* should be equal to 100%.

View File

@ -0,0 +1,154 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
Docker Container Executor
=========================
* [Overview](#Overview)
* [Cluster Configuration](#Cluster_Configuration)
* [Tips for connecting to a secure docker repository](#Tips_for_connecting_to_a_secure_docker_repository)
* [Job Configuration](#Job_Configuration)
* [Docker Image Requirements](#Docker_Image_Requirements)
* [Working example of yarn launched docker containers](#Working_example_of_yarn_launched_docker_containers)
Overview
--------
[Docker](https://www.docker.io/) combines an easy-to-use interface to Linux containers with easy-to-construct image files for those containers. In short, Docker launches very light weight virtual machines.
The Docker Container Executor (DCE) allows the YARN NodeManager to launch YARN containers into Docker containers. Users can specify the Docker images they want for their YARN containers. These containers provide a custom software environment in which the user's code runs, isolated from the software environment of the NodeManager. These containers can include special libraries needed by the application, and they can have different versions of Perl, Python, and even Java than what is installed on the NodeManager. Indeed, these containers can run a different flavor of Linux than what is running on the NodeManager -- although the YARN container must define all the environments and libraries needed to run the job, nothing will be shared with the NodeManager.
Docker for YARN provides both consistency (all YARN containers will have the same software environment) and isolation (no interference with whatever is installed on the physical machine).
Cluster Configuration
---------------------
Docker Container Executor runs in non-secure mode of HDFS and YARN. It will not run in secure mode, and will exit if it detects secure mode.
The DockerContainerExecutor requires Docker daemon to be running on the NodeManagers, and the Docker client installed and able to start Docker containers. To prevent timeouts while starting jobs, the Docker images to be used by a job should already be downloaded in the NodeManagers. Here's an example of how this can be done:
sudo docker pull sequenceiq/hadoop-docker:2.4.1
This should be done as part of the NodeManager startup.
The following properties must be set in yarn-site.xml:
```xml
<property>
<name>yarn.nodemanager.docker-container-executor.exec-name</name>
<value>/usr/bin/docker</value>
<description>
Name or path to the Docker client. This is a required parameter. If this is empty,
user must pass an image name as part of the job invocation(see below).
</description>
</property>
<property>
<name>yarn.nodemanager.container-executor.class</name>
<value>org.apache.hadoop.yarn.server.nodemanager.DockerContainerExecutor</value>
<description>
This is the container executor setting that ensures that all
jobs are started with the DockerContainerExecutor.
</description>
</property>
```
Administrators should be aware that DCE doesn't currently provide user name-space isolation. This means, in particular, that software running as root in the YARN container will have root privileges in the underlying NodeManager. Put differently, DCE currently provides no better security guarantees than YARN's Default Container Executor. In fact, DockerContainerExecutor will exit if it detects secure yarn.
Tips for connecting to a secure docker repository
-------------------------------------------------
By default, docker images are pulled from the docker public repository. The format of a docker image url is: *username*/*image\_name*. For example, sequenceiq/hadoop-docker:2.4.1 is an image in docker public repository that contains java and hadoop.
If you want your own private repository, you provide the repository url instead of your username. Therefore, the image url becomes: *private\_repo\_url*/*image\_name*. For example, if your repository is on localhost:8080, your images would be like: localhost:8080/hadoop-docker
To connect to a secure docker repository, you can use the following invocation:
```
docker login [OPTIONS] [SERVER]
Register or log in to a Docker registry server, if no server is specified
"https://index.docker.io/v1/" is the default.
-e, --email="" Email
-p, --password="" Password
-u, --username="" Username
```
If you want to login to a self-hosted registry you can specify this by adding the server name.
docker login <private_repo_url>
This needs to be run as part of the NodeManager startup, or as a cron job if the login session expires periodically. You can login to multiple docker repositories from the same NodeManager, but all your users will have access to all your repositories, as at present the DockerContainerExecutor does not support per-job docker login.
Job Configuration
-----------------
Currently you cannot configure any of the Docker settings with the job configuration. You can provide Mapper, Reducer, and ApplicationMaster environment overrides for the docker images, using the following 3 JVM properties respectively(only for MR jobs):
* `mapreduce.map.env`: You can override the mapper's image by passing `yarn.nodemanager.docker-container-executor.image-name`=*your_image_name* to this JVM property.
* `mapreduce.reduce.env`: You can override the reducer's image by passing `yarn.nodemanager.docker-container-executor.image-name`=*your_image_name* to this JVM property.
* `yarn.app.mapreduce.am.env`: You can override the ApplicationMaster's image by passing `yarn.nodemanager.docker-container-executor.image-name`=*your_image_name* to this JVM property.
Docker Image Requirements
-------------------------
The Docker Images used for YARN containers must meet the following requirements:
The distro and version of Linux in your Docker Image can be quite different from that of your NodeManager. (Docker does have a few limitations in this regard, but you're not likely to hit them.) However, if you're using the MapReduce framework, then your image will need to be configured for running Hadoop. Java must be installed in the container, and the following environment variables must be defined in the image: JAVA_HOME, HADOOP_COMMON_PATH, HADOOP_HDFS_HOME, HADOOP_MAPRED_HOME, HADOOP_YARN_HOME, and HADOOP_CONF_DIR
Working example of yarn launched docker containers
--------------------------------------------------
The following example shows how to run teragen using DockerContainerExecutor.
Step 1. First ensure that YARN is properly configured with DockerContainerExecutor(see above).
```xml
<property>
<name>yarn.nodemanager.docker-container-executor.exec-name</name>
<value>docker -H=tcp://0.0.0.0:4243</value>
<description>
Name or path to the Docker client. The tcp socket must be
where docker daemon is listening.
</description>
</property>
<property>
<name>yarn.nodemanager.container-executor.class</name>
<value>org.apache.hadoop.yarn.server.nodemanager.DockerContainerExecutor</value>
<description>
This is the container executor setting that ensures that all
jobs are started with the DockerContainerExecutor.
</description>
</property>
```
Step 2. Pick a custom Docker image if you want. In this example, we'll use sequenceiq/hadoop-docker:2.4.1 from the docker hub repository. It has jdk, hadoop, and all the previously mentioned environment variables configured.
Step 3. Run.
```bash
hadoop jar $HADOOP_PREFIX/share/hadoop/mapreduce/hadoop-mapreduce-examples-${project.version}.jar \
teragen \
-Dmapreduce.map.env="yarn.nodemanager.docker-container-executor.image-name=sequenceiq/hadoop-docker:2.4.1" \
-Dyarn.app.mapreduce.am.env="yarn.nodemanager.docker-container-executor.image-name=sequenceiq/hadoop-docker:2.4.1" \
1000 \
teragen_out_dir
```
Once it succeeds, you can check the yarn debug logs to verify that docker indeed has launched containers.

View File

@ -0,0 +1,233 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
Hadoop: Fair Scheduler
======================
* [Purpose](#Purpose)
* [Introduction](#Introduction)
* [Hierarchical queues with pluggable policies](#Hierarchical_queues_with_pluggable_policies)
* [Automatically placing applications in queues](#Automatically_placing_applications_in_queues)
* [Installation](#Installation)
* [Configuration](#Configuration)
* [Properties that can be placed in yarn-site.xml](#Properties_that_can_be_placed_in_yarn-site.xml)
* [Allocation file format](#Allocation_file_format)
* [Queue Access Control Lists](#Queue_Access_Control_Lists)
* [Administration](#Administration)
* [Modifying configuration at runtime](#Modifying_configuration_at_runtime)
* [Monitoring through web UI](#Monitoring_through_web_UI)
* [Moving applications between queues](#Moving_applications_between_queues)
##Purpose
This document describes the `FairScheduler`, a pluggable scheduler for Hadoop that allows YARN applications to share resources in large clusters fairly.
##Introduction
Fair scheduling is a method of assigning resources to applications such that all apps get, on average, an equal share of resources over time. Hadoop NextGen is capable of scheduling multiple resource types. By default, the Fair Scheduler bases scheduling fairness decisions only on memory. It can be configured to schedule with both memory and CPU, using the notion of Dominant Resource Fairness developed by Ghodsi et al. When there is a single app running, that app uses the entire cluster. When other apps are submitted, resources that free up are assigned to the new apps, so that each app eventually on gets roughly the same amount of resources. Unlike the default Hadoop scheduler, which forms a queue of apps, this lets short apps finish in reasonable time while not starving long-lived apps. It is also a reasonable way to share a cluster between a number of users. Finally, fair sharing can also work with app priorities - the priorities are used as weights to determine the fraction of total resources that each app should get.
The scheduler organizes apps further into "queues", and shares resources fairly between these queues. By default, all users share a single queue, named "default". If an app specifically lists a queue in a container resource request, the request is submitted to that queue. It is also possible to assign queues based on the user name included with the request through configuration. Within each queue, a scheduling policy is used to share resources between the running apps. The default is memory-based fair sharing, but FIFO and multi-resource with Dominant Resource Fairness can also be configured. Queues can be arranged in a hierarchy to divide resources and configured with weights to share the cluster in specific proportions.
In addition to providing fair sharing, the Fair Scheduler allows assigning guaranteed minimum shares to queues, which is useful for ensuring that certain users, groups or production applications always get sufficient resources. When a queue contains apps, it gets at least its minimum share, but when the queue does not need its full guaranteed share, the excess is split between other running apps. This lets the scheduler guarantee capacity for queues while utilizing resources efficiently when these queues don't contain applications.
The Fair Scheduler lets all apps run by default, but it is also possible to limit the number of running apps per user and per queue through the config file. This can be useful when a user must submit hundreds of apps at once, or in general to improve performance if running too many apps at once would cause too much intermediate data to be created or too much context-switching. Limiting the apps does not cause any subsequently submitted apps to fail, only to wait in the scheduler's queue until some of the user's earlier apps finish.
##Hierarchical queues with pluggable policies
The fair scheduler supports hierarchical queues. All queues descend from a queue named "root". Available resources are distributed among the children of the root queue in the typical fair scheduling fashion. Then, the children distribute the resources assigned to them to their children in the same fashion. Applications may only be scheduled on leaf queues. Queues can be specified as children of other queues by placing them as sub-elements of their parents in the fair scheduler allocation file.
A queue's name starts with the names of its parents, with periods as separators. So a queue named "queue1" under the root queue, would be referred to as "root.queue1", and a queue named "queue2" under a queue named "parent1" would be referred to as "root.parent1.queue2". When referring to queues, the root part of the name is optional, so queue1 could be referred to as just "queue1", and a queue2 could be referred to as just "parent1.queue2".
Additionally, the fair scheduler allows setting a different custom policy for each queue to allow sharing the queue's resources in any which way the user wants. A custom policy can be built by extending `org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.SchedulingPolicy`. FifoPolicy, FairSharePolicy (default), and DominantResourceFairnessPolicy are built-in and can be readily used.
Certain add-ons are not yet supported which existed in the original (MR1) Fair Scheduler. Among them, is the use of a custom policies governing priority "boosting" over certain apps.
##Automatically placing applications in queues
The Fair Scheduler allows administrators to configure policies that automatically place submitted applications into appropriate queues. Placement can depend on the user and groups of the submitter and the requested queue passed by the application. A policy consists of a set of rules that are applied sequentially to classify an incoming application. Each rule either places the app into a queue, rejects it, or continues on to the next rule. Refer to the allocation file format below for how to configure these policies.
##Installation
To use the Fair Scheduler first assign the appropriate scheduler class in yarn-site.xml:
<property>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
</property>
##Configuration
Customizing the Fair Scheduler typically involves altering two files. First, scheduler-wide options can be set by adding configuration properties in the yarn-site.xml file in your existing configuration directory. Second, in most cases users will want to create an allocation file listing which queues exist and their respective weights and capacities. The allocation file is reloaded every 10 seconds, allowing changes to be made on the fly.
###Properties that can be placed in yarn-site.xml
| Property | Description |
|:---- |:---- |
| `yarn.scheduler.fair.allocation.file` | Path to allocation file. An allocation file is an XML manifest describing queues and their properties, in addition to certain policy defaults. This file must be in the XML format described in the next section. If a relative path is given, the file is searched for on the classpath (which typically includes the Hadoop conf directory). Defaults to fair-scheduler.xml. |
| `yarn.scheduler.fair.user-as-default-queue` | Whether to use the username associated with the allocation as the default queue name, in the event that a queue name is not specified. If this is set to "false" or unset, all jobs have a shared default queue, named "default". Defaults to true. If a queue placement policy is given in the allocations file, this property is ignored. |
| `yarn.scheduler.fair.preemption` | Whether to use preemption. Defaults to false. |
| `yarn.scheduler.fair.preemption.cluster-utilization-threshold` | The utilization threshold after which preemption kicks in. The utilization is computed as the maximum ratio of usage to capacity among all resources. Defaults to 0.8f. |
| `yarn.scheduler.fair.sizebasedweight` | Whether to assign shares to individual apps based on their size, rather than providing an equal share to all apps regardless of size. When set to true, apps are weighted by the natural logarithm of one plus the app's total requested memory, divided by the natural logarithm of 2. Defaults to false. |
| `yarn.scheduler.fair.assignmultiple` | Whether to allow multiple container assignments in one heartbeat. Defaults to false. |
| `yarn.scheduler.fair.max.assign` | If assignmultiple is true, the maximum amount of containers that can be assigned in one heartbeat. Defaults to -1, which sets no limit. |
| `yarn.scheduler.fair.locality.threshold.node` | For applications that request containers on particular nodes, the number of scheduling opportunities since the last container assignment to wait before accepting a placement on another node. Expressed as a float between 0 and 1, which, as a fraction of the cluster size, is the number of scheduling opportunities to pass up. The default value of -1.0 means don't pass up any scheduling opportunities. |
| `yarn.scheduler.fair.locality.threshold.rack` | For applications that request containers on particular racks, the number of scheduling opportunities since the last container assignment to wait before accepting a placement on another rack. Expressed as a float between 0 and 1, which, as a fraction of the cluster size, is the number of scheduling opportunities to pass up. The default value of -1.0 means don't pass up any scheduling opportunities. |
| `yarn.scheduler.fair.allow-undeclared-pools` | If this is true, new queues can be created at application submission time, whether because they are specified as the application's queue by the submitter or because they are placed there by the user-as-default-queue property. If this is false, any time an app would be placed in a queue that is not specified in the allocations file, it is placed in the "default" queue instead. Defaults to true. If a queue placement policy is given in the allocations file, this property is ignored. |
| `yarn.scheduler.fair.update-interval-ms` | The interval at which to lock the scheduler and recalculate fair shares, recalculate demand, and check whether anything is due for preemption. Defaults to 500 ms. |
###Allocation file format
The allocation file must be in XML format. The format contains five types of elements:
* **Queue elements**: which represent queues. Queue elements can take an optional attribute 'type', which when set to 'parent' makes it a parent queue. This is useful when we want to create a parent queue without configuring any leaf queues. Each queue element may contain the following properties:
* minResources: minimum resources the queue is entitled to, in the form "X mb, Y vcores". For the single-resource fairness policy, the vcores value is ignored. If a queue's minimum share is not satisfied, it will be offered available resources before any other queue under the same parent. Under the single-resource fairness policy, a queue is considered unsatisfied if its memory usage is below its minimum memory share. Under dominant resource fairness, a queue is considered unsatisfied if its usage for its dominant resource with respect to the cluster capacity is below its minimum share for that resource. If multiple queues are unsatisfied in this situation, resources go to the queue with the smallest ratio between relevant resource usage and minimum. Note that it is possible that a queue that is below its minimum may not immediately get up to its minimum when it submits an application, because already-running jobs may be using those resources.
* maxResources: maximum resources a queue is allowed, in the form "X mb, Y vcores". For the single-resource fairness policy, the vcores value is ignored. A queue will never be assigned a container that would put its aggregate usage over this limit.
* maxRunningApps: limit the number of apps from the queue to run at once
* maxAMShare: limit the fraction of the queue's fair share that can be used to run application masters. This property can only be used for leaf queues. For example, if set to 1.0f, then AMs in the leaf queue can take up to 100% of both the memory and CPU fair share. The value of -1.0f will disable this feature and the amShare will not be checked. The default value is 0.5f.
* weight: to share the cluster non-proportionally with other queues. Weights default to 1, and a queue with weight 2 should receive approximately twice as many resources as a queue with the default weight.
* schedulingPolicy: to set the scheduling policy of any queue. The allowed values are "fifo"/"fair"/"drf" or any class that extends `org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.SchedulingPolicy`. Defaults to "fair". If "fifo", apps with earlier submit times are given preference for containers, but apps submitted later may run concurrently if there is leftover space on the cluster after satisfying the earlier app's requests.
* aclSubmitApps: a list of users and/or groups that can submit apps to the queue. Refer to the ACLs section below for more info on the format of this list and how queue ACLs work.
* aclAdministerApps: a list of users and/or groups that can administer a queue. Currently the only administrative action is killing an application. Refer to the ACLs section below for more info on the format of this list and how queue ACLs work.
* minSharePreemptionTimeout: number of seconds the queue is under its minimum share before it will try to preempt containers to take resources from other queues. If not set, the queue will inherit the value from its parent queue.
* fairSharePreemptionTimeout: number of seconds the queue is under its fair share threshold before it will try to preempt containers to take resources from other queues. If not set, the queue will inherit the value from its parent queue.
* fairSharePreemptionThreshold: the fair share preemption threshold for the queue. If the queue waits fairSharePreemptionTimeout without receiving fairSharePreemptionThreshold\*fairShare resources, it is allowed to preempt containers to take resources from other queues. If not set, the queue will inherit the value from its parent queue.
* **User elements**: which represent settings governing the behavior of individual users. They can contain a single property: maxRunningApps, a limit on the number of running apps for a particular user.
* **A userMaxAppsDefault element**: which sets the default running app limit for any users whose limit is not otherwise specified.
* **A defaultFairSharePreemptionTimeout element**: which sets the fair share preemption timeout for the root queue; overridden by fairSharePreemptionTimeout element in root queue.
* **A defaultMinSharePreemptionTimeout element**: which sets the min share preemption timeout for the root queue; overridden by minSharePreemptionTimeout element in root queue.
* **A defaultFairSharePreemptionThreshold element**: which sets the fair share preemption threshold for the root queue; overridden by fairSharePreemptionThreshold element in root queue.
* **A queueMaxAppsDefault element**: which sets the default running app limit for queues; overriden by maxRunningApps element in each queue.
* **A queueMaxAMShareDefault element**: which sets the default AM resource limit for queue; overriden by maxAMShare element in each queue.
* **A defaultQueueSchedulingPolicy element**: which sets the default scheduling policy for queues; overriden by the schedulingPolicy element in each queue if specified. Defaults to "fair".
* **A queuePlacementPolicy element**: which contains a list of rule elements that tell the scheduler how to place incoming apps into queues. Rules are applied in the order that they are listed. Rules may take arguments. All rules accept the "create" argument, which indicates whether the rule can create a new queue. "Create" defaults to true; if set to false and the rule would place the app in a queue that is not configured in the allocations file, we continue on to the next rule. The last rule must be one that can never issue a continue. Valid rules are:
* specified: the app is placed into the queue it requested. If the app requested no queue, i.e. it specified "default", we continue. If the app requested a queue name starting or ending with period, i.e. names like ".q1" or "q1." will be rejected.
* user: the app is placed into a queue with the name of the user who submitted it. Periods in the username will be replace with "\_dot\_", i.e. the queue name for user "first.last" is "first\_dot\_last".
* primaryGroup: the app is placed into a queue with the name of the primary group of the user who submitted it. Periods in the group name will be replaced with "\_dot\_", i.e. the queue name for group "one.two" is "one\_dot\_two".
* secondaryGroupExistingQueue: the app is placed into a queue with a name that matches a secondary group of the user who submitted it. The first secondary group that matches a configured queue will be selected. Periods in group names will be replaced with "\_dot\_", i.e. a user with "one.two" as one of their secondary groups would be placed into the "one\_dot\_two" queue, if such a queue exists.
* nestedUserQueue : the app is placed into a queue with the name of the user under the queue suggested by the nested rule. This is similar to ‘user’ rule,the difference being in 'nestedUserQueue' rule,user queues can be created under any parent queue, while 'user' rule creates user queues only under root queue. Note that nestedUserQueue rule would be applied only if the nested rule returns a parent queue.One can configure a parent queue either by setting 'type' attribute of queue to 'parent' or by configuring at least one leaf under that queue which makes it a parent. See example allocation for a sample use case.
* default: the app is placed into the queue specified in the 'queue' attribute of the default rule. If 'queue' attribute is not specified, the app is placed into 'root.default' queue.
* reject: the app is rejected.
An example allocation file is given here:
```xml
<?xml version="1.0"?>
<allocations>
<queue name="sample_queue">
<minResources>10000 mb,0vcores</minResources>
<maxResources>90000 mb,0vcores</maxResources>
<maxRunningApps>50</maxRunningApps>
<maxAMShare>0.1</maxAMShare>
<weight>2.0</weight>
<schedulingPolicy>fair</schedulingPolicy>
<queue name="sample_sub_queue">
<aclSubmitApps>charlie</aclSubmitApps>
<minResources>5000 mb,0vcores</minResources>
</queue>
</queue>
<queueMaxAMShareDefault>0.5</queueMaxAMShareDefault>
<!-- Queue 'secondary_group_queue' is a parent queue and may have
user queues under it -->
<queue name="secondary_group_queue" type="parent">
<weight>3.0</weight>
</queue>
<user name="sample_user">
<maxRunningApps>30</maxRunningApps>
</user>
<userMaxAppsDefault>5</userMaxAppsDefault>
<queuePlacementPolicy>
<rule name="specified" />
<rule name="primaryGroup" create="false" />
<rule name="nestedUserQueue">
<rule name="secondaryGroupExistingQueue" create="false" />
</rule>
<rule name="default" queue="sample_queue"/>
</queuePlacementPolicy>
</allocations>
```
Note that for backwards compatibility with the original FairScheduler, "queue" elements can instead be named as "pool" elements.
###Queue Access Control Lists
Queue Access Control Lists (ACLs) allow administrators to control who may take actions on particular queues. They are configured with the aclSubmitApps and aclAdministerApps properties, which can be set per queue. Currently the only supported administrative action is killing an application. Anybody who may administer a queue may also submit applications to it. These properties take values in a format like "user1,user2 group1,group2" or " group1,group2". An action on a queue will be permitted if its user or group is in the ACL of that queue or in the ACL of any of that queue's ancestors. So if queue2 is inside queue1, and user1 is in queue1's ACL, and user2 is in queue2's ACL, then both users may submit to queue2.
**Note:** The delimiter is a space character. To specify only ACL groups, begin the value with a space character.
The root queue's ACLs are "\*" by default which, because ACLs are passed down, means that everybody may submit to and kill applications from every queue. To start restricting access, change the root queue's ACLs to something other than "\*".
##Administration
The fair scheduler provides support for administration at runtime through a few mechanisms:
###Modifying configuration at runtime
It is possible to modify minimum shares, limits, weights, preemption timeouts and queue scheduling policies at runtime by editing the allocation file. The scheduler will reload this file 10-15 seconds after it sees that it was modified.
###Monitoring through web UI
Current applications, queues, and fair shares can be examined through the ResourceManager's web interface, at `http://*ResourceManager URL*/cluster/scheduler`.
The following fields can be seen for each queue on the web interface:
* Used Resources - The sum of resources allocated to containers within the queue.
* Num Active Applications - The number of applications in the queue that have received at least one container.
* Num Pending Applications - The number of applications in the queue that have not yet received any containers.
* Min Resources - The configured minimum resources that are guaranteed to the queue.
* Max Resources - The configured maximum resources that are allowed to the queue.
* Instantaneous Fair Share - The queue's instantaneous fair share of resources. These shares consider only actives queues (those with running applications), and are used for scheduling decisions. Queues may be allocated resources beyond their shares when other queues aren't using them. A queue whose resource consumption lies at or below its instantaneous fair share will never have its containers preempted.
* Steady Fair Share - The queue's steady fair share of resources. These shares consider all the queues irrespective of whether they are active (have running applications) or not. These are computed less frequently and change only when the configuration or capacity changes.They are meant to provide visibility into resources the user can expect, and hence displayed in the Web UI.
###Moving applications between queues
The Fair Scheduler supports moving a running application to a different queue. This can be useful for moving an important application to a higher priority queue, or for moving an unimportant application to a lower priority queue. Apps can be moved by running `yarn application -movetoqueue appID -queue targetQueueName`.
When an application is moved to a queue, its existing allocations become counted with the new queue's allocations instead of the old for purposes of determining fairness. An attempt to move an application to a queue will fail if the addition of the app's resources to that queue would violate the its maxRunningApps or maxResources constraints.

View File

@ -0,0 +1,57 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
NodeManager Overview
=====================
* [Overview](#Overview)
* [Health Checker Service](#Health_checker_service)
* [Disk Checker](#Disk_Checker)
* [External Health Script](#External_Health_Script)
Overview
--------
The NodeManager is responsible for launching and managing containers on a node. Containers execute tasks as specified by the AppMaster.
Health Checker Service
----------------------
The NodeManager runs services to determine the health of the node it is executing on. The services perform checks on the disk as well as any user specified tests. If any health check fails, the NodeManager marks the node as unhealthy and communicates this to the ResourceManager, which then stops assigning containers to the node. Communication of the node status is done as part of the heartbeat between the NodeManager and the ResourceManager. The intervals at which the disk checker and health monitor(described below) run don't affect the heartbeat intervals. When the heartbeat takes place, the status of both checks is used to determine the health of the node.
###Disk Checker
The disk checker checks the state of the disks that the NodeManager is configured to use(local-dirs and log-dirs, configured using yarn.nodemanager.local-dirs and yarn.nodemanager.log-dirs respectively). The checks include permissions and free disk space. It also checks that the filesystem isn't in a read-only state. The checks are run at 2 minute intervals by default but can be configured to run as often as the user desires. If a disk fails the check, the NodeManager stops using that particular disk but still reports the node status as healthy. However if a number of disks fail the check(the number can be configured, as explained below), then the node is reported as unhealthy to the ResourceManager and new containers will not be assigned to the node. In addition, once a disk is marked as unhealthy, the NodeManager stops checking it to see if it has recovered(e.g. disk became full and was then cleaned up). The only way for the NodeManager to use that disk to restart the software on the node. The following configuration parameters can be used to modify the disk checks:
| Configuration Name | Allowed Values | Description |
|:---- |:---- |:---- |
| `yarn.nodemanager.disk-health-checker.enable` | true, false | Enable or disable the disk health checker service |
| `yarn.nodemanager.disk-health-checker.interval-ms` | Positive integer | The interval, in milliseconds, at which the disk checker should run; the default value is 2 minutes |
| `yarn.nodemanager.disk-health-checker.min-healthy-disks` | Float between 0-1 | The minimum fraction of disks that must pass the check for the NodeManager to mark the node as healthy; the default is 0.25 |
| `yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage` | Float between 0-100 | The maximum percentage of disk space that may be utilized before a disk is marked as unhealthy by the disk checker service. This check is run for every disk used by the NodeManager. The default value is 100 i.e. the entire disk can be used. |
| `yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb` | Integer | The minimum amount of free space that must be available on the disk for the disk checker service to mark the disk as healthy. This check is run for every disk used by the NodeManager. The default value is 0 i.e. the entire disk can be used. |
###External Health Script
Users may specify their own health checker script that will be invoked by the health checker service. Users may specify a timeout as well as options to be passed to the script. If the script exits with a non-zero exit code, times out or results in an exception being thrown, the node is marked as unhealthy. Please note that if the script cannot be executed due to permissions or an incorrect path, etc, then it counts as a failure and the node will be reported as unhealthy. Please note that speifying a health check script is not mandatory. If no script is specified, only the disk checker status will be used to determine the health of the node. The following configuration parameters can be used to set the health script:
| Configuration Name | Allowed Values | Description |
|:---- |:---- |:---- |
| `yarn.nodemanager.health-checker.interval-ms` | Postive integer | The interval, in milliseconds, at which health checker service runs; the default value is 10 minutes. |
| `yarn.nodemanager.health-checker.script.timeout-ms` | Postive integer | The timeout for the health script that's executed; the default value is 20 minutes. |
| `yarn.nodemanager.health-checker.script.path` | String | Absolute path to the health check script to be run. |
| `yarn.nodemanager.health-checker.script.opts` | String | Arguments to be passed to the script when the script is executed. |

View File

@ -0,0 +1,57 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
Using CGroups with YARN
=======================
* [CGroups Configuration](#CGroups_configuration)
* [CGroups and Security](#CGroups_and_security)
CGroups is a mechanism for aggregating/partitioning sets of tasks, and all their future children, into hierarchical groups with specialized behaviour. CGroups is a Linux kernel feature and was merged into kernel version 2.6.24. From a YARN perspective, this allows containers to be limited in their resource usage. A good example of this is CPU usage. Without CGroups, it becomes hard to limit container CPU usage. Currently, CGroups is only used for limiting CPU usage.
CGroups Configuration
---------------------
This section describes the configuration variables for using CGroups.
The following settings are related to setting up CGroups. These need to be set in *yarn-site.xml*.
|Configuration Name | Description |
|:---- |:---- |
| `yarn.nodemanager.container-executor.class` | This should be set to "org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor". CGroups is a Linux kernel feature and is exposed via the LinuxContainerExecutor. |
| `yarn.nodemanager.linux-container-executor.resources-handler.class` | This should be set to "org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler". Using the LinuxContainerExecutor doesn't force you to use CGroups. If you wish to use CGroups, the resource-handler-class must be set to CGroupsLCEResourceHandler. |
| `yarn.nodemanager.linux-container-executor.cgroups.hierarchy` | The cgroups hierarchy under which to place YARN proccesses(cannot contain commas). If yarn.nodemanager.linux-container-executor.cgroups.mount is false (that is, if cgroups have been pre-configured), then this cgroups hierarchy must already exist |
| `yarn.nodemanager.linux-container-executor.cgroups.mount` | Whether the LCE should attempt to mount cgroups if not found - can be true or false. |
| `yarn.nodemanager.linux-container-executor.cgroups.mount-path` | Where the LCE should attempt to mount cgroups if not found. Common locations include /sys/fs/cgroup and /cgroup; the default location can vary depending on the Linux distribution in use. This path must exist before the NodeManager is launched. Only used when the LCE resources handler is set to the CgroupsLCEResourcesHandler, and yarn.nodemanager.linux-container-executor.cgroups.mount is true. A point to note here is that the container-executor binary will try to mount the path specified + "/" + the subsystem. In our case, since we are trying to limit CPU the binary tries to mount the path specified + "/cpu" and that's the path it expects to exist. |
| `yarn.nodemanager.linux-container-executor.group` | The Unix group of the NodeManager. It should match the setting in "container-executor.cfg". This configuration is required for validating the secure access of the container-executor binary. |
The following settings are related to limiting resource usage of YARN containers:
|Configuration Name | Description |
|:---- |:---- |
| `yarn.nodemanager.resource.percentage-physical-cpu-limit` | This setting lets you limit the cpu usage of all YARN containers. It sets a hard upper limit on the cumulative CPU usage of the containers. For example, if set to 60, the combined CPU usage of all YARN containers will not exceed 60%. |
| `yarn.nodemanager.linux-container-executor.cgroups.strict-resource-usage` | CGroups allows cpu usage limits to be hard or soft. When this setting is true, containers cannot use more CPU usage than allocated even if spare CPU is available. This ensures that containers can only use CPU that they were allocated. When set to false, containers can use spare CPU if available. It should be noted that irrespective of whether set to true or false, at no time can the combined CPU usage of all containers exceed the value specified in "yarn.nodemanager.resource.percentage-physical-cpu-limit". |
CGroups and security
--------------------
CGroups itself has no requirements related to security. However, the LinuxContainerExecutor does have some requirements. If running in non-secure mode, by default, the LCE runs all jobs as user "nobody". This user can be changed by setting "yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user" to the desired user. However, it can also be configured to run jobs as the user submitting the job. In that case "yarn.nodemanager.linux-container-executor.nonsecure-mode.limit-users" should be set to false.
| yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user | yarn.nodemanager.linux-container-executor.nonsecure-mode.limit-users | User running jobs |
|:---- |:---- |:---- |
| (default) | (default) | nobody |
| yarn | (default) | yarn |
| yarn | false | (User submitting the job) |

View File

@ -0,0 +1,543 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
NodeManager REST API's
=======================
* [Overview](#Overview)
* [NodeManager Information API](#NodeManager_Information_API)
* [Applications API](#Applications_API)
* [Application API](#Application_API)
* [Containers API](#Containers_API)
* [Container API](#Container_API)
Overview
--------
The NodeManager REST API's allow the user to get status on the node and information about applications and containers running on that node.
NodeManager Information API
---------------------------
The node information resource provides overall information about that particular node.
### URI
Both of the following URI's give you the cluster information.
* http://<nm http address:port>/ws/v1/node
* http://<nm http address:port>/ws/v1/node/info
### HTTP Operations Supported
* GET
### Query Parameters Supported
None
### Elements of the *nodeInfo* object
| Properties | Data Type | Description |
|:---- |:---- |:---- |
| id | long | The NodeManager id |
| nodeHostName | string | The host name of the NodeManager |
| totalPmemAllocatedContainersMB | long | The amount of physical memory allocated for use by containers in MB |
| totalVmemAllocatedContainersMB | long | The amount of virtual memory allocated for use by containers in MB |
| totalVCoresAllocatedContainers | long | The number of virtual cores allocated for use by containers |
| lastNodeUpdateTime | long | The last timestamp at which the health report was received (in ms since epoch) |
| healthReport | string | The diagnostic health report of the node |
| nodeHealthy | boolean | true/false indicator of if the node is healthy |
| nodeManagerVersion | string | Version of the NodeManager |
| nodeManagerBuildVersion | string | NodeManager build string with build version, user, and checksum |
| nodeManagerVersionBuiltOn | string | Timestamp when NodeManager was built(in ms since epoch) |
| hadoopVersion | string | Version of hadoop common |
| hadoopBuildVersion | string | Hadoop common build string with build version, user, and checksum |
| hadoopVersionBuiltOn | string | Timestamp when hadoop common was built(in ms since epoch) |
### Response Examples
**JSON response**
HTTP Request:
GET http://<nm http address:port>/ws/v1/node/info
Response Header:
HTTP/1.1 200 OK
Content-Type: application/json
Transfer-Encoding: chunked
Server: Jetty(6.1.26)
Response Body:
```json
{
"nodeInfo" : {
"hadoopVersionBuiltOn" : "Mon Jan 9 14:58:42 UTC 2012",
"nodeManagerBuildVersion" : "0.23.1-SNAPSHOT from 1228355 by user1 source checksum 20647f76c36430e888cc7204826a445c",
"lastNodeUpdateTime" : 1326222266126,
"totalVmemAllocatedContainersMB" : 17203,
"totalVCoresAllocatedContainers" : 8,
"nodeHealthy" : true,
"healthReport" : "",
"totalPmemAllocatedContainersMB" : 8192,
"nodeManagerVersionBuiltOn" : "Mon Jan 9 15:01:59 UTC 2012",
"nodeManagerVersion" : "0.23.1-SNAPSHOT",
"id" : "host.domain.com:8041",
"hadoopBuildVersion" : "0.23.1-SNAPSHOT from 1228292 by user1 source checksum 3eba233f2248a089e9b28841a784dd00",
"nodeHostName" : "host.domain.com",
"hadoopVersion" : "0.23.1-SNAPSHOT"
}
}
```
**XML response**
HTTP Request:
Accept: application/xml
GET http://<nm http address:port>/ws/v1/node/info
Response Header:
HTTP/1.1 200 OK
Content-Type: application/xml
Content-Length: 983
Server: Jetty(6.1.26)
Response Body:
```xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<nodeInfo>
<healthReport/>
<totalVmemAllocatedContainersMB>17203</totalVmemAllocatedContainersMB>
<totalPmemAllocatedContainersMB>8192</totalPmemAllocatedContainersMB>
<totalVCoresAllocatedContainers>8</totalVCoresAllocatedContainers>
<lastNodeUpdateTime>1326222386134</lastNodeUpdateTime>
<nodeHealthy>true</nodeHealthy>
<nodeManagerVersion>0.23.1-SNAPSHOT</nodeManagerVersion>
<nodeManagerBuildVersion>0.23.1-SNAPSHOT from 1228355 by user1 source checksum 20647f76c36430e888cc7204826a445c</nodeManagerBuildVersion>
<nodeManagerVersionBuiltOn>Mon Jan 9 15:01:59 UTC 2012</nodeManagerVersionBuiltOn>
<hadoopVersion>0.23.1-SNAPSHOT</hadoopVersion>
<hadoopBuildVersion>0.23.1-SNAPSHOT from 1228292 by user1 source checksum 3eba233f2248a089e9b28841a784dd00</hadoopBuildVersion>
<hadoopVersionBuiltOn>Mon Jan 9 14:58:42 UTC 2012</hadoopVersionBuiltOn>
<id>host.domain.com:8041</id>
<nodeHostName>host.domain.com</nodeHostName>
</nodeInfo>
```
Applications API
----------------
With the Applications API, you can obtain a collection of resources, each of which represents an application. When you run a GET operation on this resource, you obtain a collection of Application Objects. See also [Application API](#Application_API) for syntax of the application object.
### URI
* http://<nm http address:port>/ws/v1/node/apps
### HTTP Operations Supported
* GET
### Query Parameters Supported
Multiple paramters can be specified.
* state - application state
* user - user name
### Elements of the *apps* (Applications) object
When you make a request for the list of applications, the information will be returned as a collection of app objects. See also [Application API](#Application_API) for syntax of the app object.
| Properties | Data Type | Description |
|:---- |:---- |:---- |
| app | array of app objects(JSON)/zero or more app objects(XML) | A collection of application objects |
### Response Examples
**JSON response**
HTTP Request:
GET http://<nm http address:port>/ws/v1/node/apps
Response Header:
HTTP/1.1 200 OK
Content-Type: application/json
Transfer-Encoding: chunked
Server: Jetty(6.1.26)
Response Body:
```json
{
"apps" : {
"app" : [
{
"containerids" : [
"container_1326121700862_0003_01_000001",
"container_1326121700862_0003_01_000002"
],
"user" : "user1",
"id" : "application_1326121700862_0003",
"state" : "RUNNING"
},
{
"user" : "user1",
"id" : "application_1326121700862_0002",
"state" : "FINISHED"
}
]
}
}
```
**XML response**
HTTP Request:
GET http://<nm http address:port>/ws/v1/node/apps
Accept: application/xml
Response Header:
HTTP/1.1 200 OK
Content-Type: application/xml
Content-Length: 400
Server: Jetty(6.1.26)
Response Body:
```xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<apps>
<app>
<id>application_1326121700862_0002</id>
<state>FINISHED</state>
<user>user1</user>
</app>
<app>
<id>application_1326121700862_0003</id>
<state>RUNNING</state>
<user>user1</user>
<containerids>container_1326121700862_0003_01_000002</containerids>
<containerids>container_1326121700862_0003_01_000001</containerids>
</app>
</apps>
```
Application API
---------------
An application resource contains information about a particular application that was run or is running on this NodeManager.
### URI
Use the following URI to obtain an app Object, for a application identified by the appid value.
* http://<nm http address:port>/ws/v1/node/apps/{appid}
### HTTP Operations Supported
* GET
### Query Parameters Supported
None
### Elements of the *app* (Application) object
| Properties | Data Type | Description |
|:---- |:---- |:---- |
| id | string | The application id |
| user | string | The user who started the application |
| state | string | The state of the application - valid states are: NEW, INITING, RUNNING, FINISHING\_CONTAINERS\_WAIT, APPLICATION\_RESOURCES\_CLEANINGUP, FINISHED |
| containerids | array of containerids(JSON)/zero or more containerids(XML) | The list of containerids currently being used by the application on this node. If not present then no containers are currently running for this application. |
### Response Examples
**JSON response**
HTTP Request:
GET http://<nm http address:port>/ws/v1/node/apps/application_1326121700862_0005
Response Header:
HTTP/1.1 200 OK
Content-Type: application/json
Transfer-Encoding: chunked
Server: Jetty(6.1.26)
Response Body:
```json
{
"app" : {
"containerids" : [
"container_1326121700862_0005_01_000003",
"container_1326121700862_0005_01_000001"
],
"user" : "user1",
"id" : "application_1326121700862_0005",
"state" : "RUNNING"
}
}
```
**XML response**
HTTP Request:
GET http://<nm http address:port>/ws/v1/node/apps/application_1326121700862_0005
Accept: application/xml
Response Header:
HTTP/1.1 200 OK
Content-Type: application/xml
Content-Length: 281
Server: Jetty(6.1.26)
Response Body:
```xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<app>
<id>application_1326121700862_0005</id>
<state>RUNNING</state>
<user>user1</user>
<containerids>container_1326121700862_0005_01_000003</containerids>
<containerids>container_1326121700862_0005_01_000001</containerids>
</app>
```
Containers API
--------------
With the containers API, you can obtain a collection of resources, each of which represents a container. When you run a GET operation on this resource, you obtain a collection of Container Objects. See also [Container API](#Container_API) for syntax of the container object.
### URI
* http://<nm http address:port>/ws/v1/node/containers
### HTTP Operations Supported
* GET
### Query Parameters Supported
None
### Elements of the *containers* object
When you make a request for the list of containers, the information will be returned as collection of container objects. See also [Container API](#Container_API) for syntax of the container object.
| Properties | Data Type | Description |
|:---- |:---- |:---- |
| containers | array of container objects(JSON)/zero or more container objects(XML) | A collection of container objects |
### Response Examples
**JSON response**
HTTP Request:
GET http://<nm http address:port>/ws/v1/node/containers
Response Header:
HTTP/1.1 200 OK
Content-Type: application/json
Transfer-Encoding: chunked
Server: Jetty(6.1.26)
Response Body:
```json
{
"containers" : {
"container" : [
{
"nodeId" : "host.domain.com:8041",
"totalMemoryNeededMB" : 2048,
"totalVCoresNeeded" : 1,
"state" : "RUNNING",
"diagnostics" : "",
"containerLogsLink" : "http://host.domain.com:8042/node/containerlogs/container_1326121700862_0006_01_000001/user1",
"user" : "user1",
"id" : "container_1326121700862_0006_01_000001",
"exitCode" : -1000
},
{
"nodeId" : "host.domain.com:8041",
"totalMemoryNeededMB" : 2048,
"totalVCoresNeeded" : 2,
"state" : "RUNNING",
"diagnostics" : "",
"containerLogsLink" : "http://host.domain.com:8042/node/containerlogs/container_1326121700862_0006_01_000003/user1",
"user" : "user1",
"id" : "container_1326121700862_0006_01_000003",
"exitCode" : -1000
}
]
}
}
```
**XML response**
HTTP Request:
GET http://<nm http address:port>/ws/v1/node/containers
Accept: application/xml
Response Header:
HTTP/1.1 200 OK
Content-Type: application/xml
Content-Length: 988
Server: Jetty(6.1.26)
Response Body:
```xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<containers>
<container>
<id>container_1326121700862_0006_01_000001</id>
<state>RUNNING</state>
<exitCode>-1000</exitCode>
<diagnostics/>
<user>user1</user>
<totalMemoryNeededMB>2048</totalMemoryNeededMB>
<totalVCoresNeeded>1</totalVCoresNeeded>
<containerLogsLink>http://host.domain.com:8042/node/containerlogs/container_1326121700862_0006_01_000001/user1</containerLogsLink>
<nodeId>host.domain.com:8041</nodeId>
</container>
<container>
<id>container_1326121700862_0006_01_000003</id>
<state>DONE</state>
<exitCode>0</exitCode>
<diagnostics>Container killed by the ApplicationMaster.</diagnostics>
<user>user1</user>
<totalMemoryNeededMB>2048</totalMemoryNeededMB>
<totalVCoresNeeded>2</totalVCoresNeeded>
<containerLogsLink>http://host.domain.com:8042/node/containerlogs/container_1326121700862_0006_01_000003/user1</containerLogsLink>
<nodeId>host.domain.com:8041</nodeId>
</container>
</containers>
```
Container API
-------------
A container resource contains information about a particular container that is running on this NodeManager.
### URI
Use the following URI to obtain a Container Object, from a container identified by the containerid value.
* http://<nm http address:port>/ws/v1/node/containers/{containerid}
### HTTP Operations Supported
* GET
### Query Parameters Supported
None
### Elements of the *container* object
| Properties | Data Type | Description |
|:---- |:---- |:---- |
| id | string | The container id |
| state | string | State of the container - valid states are: NEW, LOCALIZING, LOCALIZATION\_FAILED, LOCALIZED, RUNNING, EXITED\_WITH\_SUCCESS, EXITED\_WITH\_FAILURE, KILLING, CONTAINER\_CLEANEDUP\_AFTER\_KILL, CONTAINER\_RESOURCES\_CLEANINGUP, DONE |
| nodeId | string | The id of the node the container is on |
| containerLogsLink | string | The http link to the container logs |
| user | string | The user name of the user which started the container |
| exitCode | int | Exit code of the container |
| diagnostics | string | A diagnostic message for failed containers |
| totalMemoryNeededMB | long | Total amout of memory needed by the container (in MB) |
| totalVCoresNeeded | long | Total number of virtual cores needed by the container |
### Response Examples
**JSON response**
HTTP Request:
GET http://<nm http address:port>/ws/v1/nodes/containers/container_1326121700862_0007_01_000001
Response Header:
HTTP/1.1 200 OK
Content-Type: application/json
Transfer-Encoding: chunked
Server: Jetty(6.1.26)
Response Body:
```json
{
"container" : {
"nodeId" : "host.domain.com:8041",
"totalMemoryNeededMB" : 2048,
"totalVCoresNeeded" : 1,
"state" : "RUNNING",
"diagnostics" : "",
"containerLogsLink" : "http://host.domain.com:8042/node/containerlogs/container_1326121700862_0007_01_000001/user1",
"user" : "user1",
"id" : "container_1326121700862_0007_01_000001",
"exitCode" : -1000
}
}
```
**XML response**
HTTP Request:
GET http://<nm http address:port>/ws/v1/node/containers/container_1326121700862_0007_01_000001
Accept: application/xml
Response Header:
HTTP/1.1 200 OK
Content-Type: application/xml
Content-Length: 491
Server: Jetty(6.1.26)
Response Body:
```xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<container>
<id>container_1326121700862_0007_01_000001</id>
<state>RUNNING</state>
<exitCode>-1000</exitCode>
<diagnostics/>
<user>user1</user>
<totalMemoryNeededMB>2048</totalMemoryNeededMB>
<totalVCoresNeeded>1</totalVCoresNeeded>
<containerLogsLink>http://host.domain.com:8042/node/containerlogs/container_1326121700862_0007_01_000001/user1</containerLogsLink>
<nodeId>host.domain.com:8041</nodeId>
</container>
```

View File

@ -0,0 +1,53 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
NodeManager Restart
===================
* [Introduction](#Introduction)
* [Enabling NM Restart](#Enabling_NM_Restart)
Introduction
------------
This document gives an overview of NodeManager (NM) restart, a feature that enables the NodeManager to be restarted without losing the active containers running on the node. At a high level, the NM stores any necessary state to a local state-store as it processes container-management requests. When the NM restarts, it recovers by first loading state for various subsystems and then letting those subsystems perform recovery using the loaded state.
Enabling NM Restart
-------------------
Step 1. To enable NM Restart functionality, set the following property in **conf/yarn-site.xml** to *true*.
| Property | Value |
|:---- |:---- |
| `yarn.nodemanager.recovery.enabled` | `true`, (default value is set to false) |
Step 2. Configure a path to the local file-system directory where the NodeManager can save its run state.
| Property | Description |
|:---- |:---- |
| `yarn.nodemanager.recovery.dir` | The local filesystem directory in which the node manager will store state when recovery is enabled. The default value is set to `$hadoop.tmp.dir/yarn-nm-recovery`. |
Step 3. Configure a valid RPC address for the NodeManager.
| Property | Description |
|:---- |:---- |
| `yarn.nodemanager.address` | Ephemeral ports (port 0, which is default) cannot be used for the NodeManager's RPC server specified via yarn.nodemanager.address as it can make NM use different ports before and after a restart. This will break any previously running clients that were communicating with the NM before restart. Explicitly setting yarn.nodemanager.address to an address with specific port number (for e.g 0.0.0.0:45454) is a precondition for enabling NM restart. |
Step 4. Auxiliary services.
* NodeManagers in a YARN cluster can be configured to run auxiliary services. For a completely functional NM restart, YARN relies on any auxiliary service configured to also support recovery. This usually includes (1) avoiding usage of ephemeral ports so that previously running clients (in this case, usually containers) are not disrupted after restart and (2) having the auxiliary service itself support recoverability by reloading any previous state when NodeManager restarts and reinitializes the auxiliary service.
* A simple example for the above is the auxiliary service 'ShuffleHandler' for MapReduce (MR). ShuffleHandler respects the above two requirements already, so users/admins don't have do anything for it to support NM restart: (1) The configuration property **mapreduce.shuffle.port** controls which port the ShuffleHandler on a NodeManager host binds to, and it defaults to a non-ephemeral port. (2) The ShuffleHandler service also already supports recovery of previous state after NM restarts.

View File

@ -0,0 +1,140 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
ResourceManager High Availability
=================================
* [Introduction](#Introduction)
* [Architecture](#Architecture)
* [RM Failover](#RM_Failover)
* [Recovering prevous active-RM's state](#Recovering_prevous_active-RMs_state)
* [Deployment](#Deployment)
* [Configurations](#Configurations)
* [Admin commands](#Admin_commands)
* [ResourceManager Web UI services](#ResourceManager_Web_UI_services)
* [Web Services](#Web_Services)
Introduction
------------
This guide provides an overview of High Availability of YARN's ResourceManager, and details how to configure and use this feature. The ResourceManager (RM) is responsible for tracking the resources in a cluster, and scheduling applications (e.g., MapReduce jobs). Prior to Hadoop 2.4, the ResourceManager is the single point of failure in a YARN cluster. The High Availability feature adds redundancy in the form of an Active/Standby ResourceManager pair to remove this otherwise single point of failure.
Architecture
------------
![Overview of ResourceManager High Availability](images/rm-ha-overview.png)
### RM Failover
ResourceManager HA is realized through an Active/Standby architecture - at any point of time, one of the RMs is Active, and one or more RMs are in Standby mode waiting to take over should anything happen to the Active. The trigger to transition-to-active comes from either the admin (through CLI) or through the integrated failover-controller when automatic-failover is enabled.
#### Manual transitions and failover
When automatic failover is not enabled, admins have to manually transition one of the RMs to Active. To failover from one RM to the other, they are expected to first transition the Active-RM to Standby and transition a Standby-RM to Active. All this can be done using the "`yarn rmadmin`" CLI.
#### Automatic failover
The RMs have an option to embed the Zookeeper-based ActiveStandbyElector to decide which RM should be the Active. When the Active goes down or becomes unresponsive, another RM is automatically elected to be the Active which then takes over. Note that, there is no need to run a separate ZKFC daemon as is the case for HDFS because ActiveStandbyElector embedded in RMs acts as a failure detector and a leader elector instead of a separate ZKFC deamon.
#### Client, ApplicationMaster and NodeManager on RM failover
When there are multiple RMs, the configuration (yarn-site.xml) used by clients and nodes is expected to list all the RMs. Clients, ApplicationMasters (AMs) and NodeManagers (NMs) try connecting to the RMs in a round-robin fashion until they hit the Active RM. If the Active goes down, they resume the round-robin polling until they hit the "new" Active. This default retry logic is implemented as `org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider`. You can override the logic by implementing `org.apache.hadoop.yarn.client.RMFailoverProxyProvider` and setting the value of `yarn.client.failover-proxy-provider` to the class name.
### Recovering prevous active-RM's state
With the [ResourceManger Restart](./ResourceManagerRestart.html) enabled, the RM being promoted to an active state loads the RM internal state and continues to operate from where the previous active left off as much as possible depending on the RM restart feature. A new attempt is spawned for each managed application previously submitted to the RM. Applications can checkpoint periodically to avoid losing any work. The state-store must be visible from the both of Active/Standby RMs. Currently, there are two RMStateStore implementations for persistence - FileSystemRMStateStore and ZKRMStateStore. The `ZKRMStateStore` implicitly allows write access to a single RM at any point in time, and hence is the recommended store to use in an HA cluster. When using the ZKRMStateStore, there is no need for a separate fencing mechanism to address a potential split-brain situation where multiple RMs can potentially assume the Active role.
Deployment
----------
### Configurations
Most of the failover functionality is tunable using various configuration properties. Following is a list of required/important ones. yarn-default.xml carries a full-list of knobs. See [yarn-default.xml](../hadoop-yarn-common/yarn-default.xml) for more information including default values. See the document for [ResourceManger Restart](./ResourceManagerRestart.html) also for instructions on setting up the state-store.
| Configuration Properties | Description |
|:---- |:---- |
| `yarn.resourcemanager.zk-address` | Address of the ZK-quorum. Used both for the state-store and embedded leader-election. |
| `yarn.resourcemanager.ha.enabled` | Enable RM HA. |
| `yarn.resourcemanager.ha.rm-ids` | List of logical IDs for the RMs. e.g., "rm1,rm2". |
| `yarn.resourcemanager.hostname.*rm-id*` | For each *rm-id*, specify the hostname the RM corresponds to. Alternately, one could set each of the RM's service addresses. |
| `yarn.resourcemanager.ha.id` | Identifies the RM in the ensemble. This is optional; however, if set, admins have to ensure that all the RMs have their own IDs in the config. |
| `yarn.resourcemanager.ha.automatic-failover.enabled` | Enable automatic failover; By default, it is enabled only when HA is enabled. |
| `yarn.resourcemanager.ha.automatic-failover.embedded` | Use embedded leader-elector to pick the Active RM, when automatic failover is enabled. By default, it is enabled only when HA is enabled. |
| `yarn.resourcemanager.cluster-id` | Identifies the cluster. Used by the elector to ensure an RM doesn't take over as Active for another cluster. |
| `yarn.client.failover-proxy-provider` | The class to be used by Clients, AMs and NMs to failover to the Active RM. |
| `yarn.client.failover-max-attempts` | The max number of times FailoverProxyProvider should attempt failover. |
| `yarn.client.failover-sleep-base-ms` | The sleep base (in milliseconds) to be used for calculating the exponential delay between failovers. |
| `yarn.client.failover-sleep-max-ms` | The maximum sleep time (in milliseconds) between failovers. |
| `yarn.client.failover-retries` | The number of retries per attempt to connect to a ResourceManager. |
| `yarn.client.failover-retries-on-socket-timeouts` | The number of retries per attempt to connect to a ResourceManager on socket timeouts. |
#### Sample configurations
Here is the sample of minimal setup for RM failover.
```xml
<property>
<name>yarn.resourcemanager.ha.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.resourcemanager.cluster-id</name>
<value>cluster1</value>
</property>
<property>
<name>yarn.resourcemanager.ha.rm-ids</name>
<value>rm1,rm2</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.rm1</name>
<value>master1</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.rm2</name>
<value>master2</value>
</property>
<property>
<name>yarn.resourcemanager.zk-address</name>
<value>zk1:2181,zk2:2181,zk3:2181</value>
</property>
```
### Admin commands
`yarn rmadmin` has a few HA-specific command options to check the health/state of an RM, and transition to Active/Standby. Commands for HA take service id of RM set by `yarn.resourcemanager.ha.rm-ids` as argument.
$ yarn rmadmin -getServiceState rm1
active
$ yarn rmadmin -getServiceState rm2
standby
If automatic failover is enabled, you can not use manual transition command. Though you can override this by --forcemanual flag, you need caution.
$ yarn rmadmin -transitionToStandby rm1
Automatic failover is enabled for org.apache.hadoop.yarn.client.RMHAServiceTarget@1d8299fd
Refusing to manually manage HA state, since it may cause
a split-brain scenario or other incorrect state.
If you are very sure you know what you are doing, please
specify the forcemanual flag.
See [YarnCommands](./YarnCommands.html) for more details.
### ResourceManager Web UI services
Assuming a standby RM is up and running, the Standby automatically redirects all web requests to the Active, except for the "About" page.
### Web Services
Assuming a standby RM is up and running, RM web-services described at [ResourceManager REST APIs](./ResourceManagerRest.html) when invoked on a standby RM are automatically redirected to the Active RM.

View File

@ -0,0 +1,181 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
ResourceManger Restart
======================
* [Overview](#Overview)
* [Feature](#Feature)
* [Configurations](#Configurations)
* [Enable RM Restart](#Enable_RM_Restart)
* [Configure the state-store for persisting the RM state](#Configure_the_state-store_for_persisting_the_RM_state)
* [How to choose the state-store implementation](#How_to_choose_the_state-store_implementation)
* [Configurations for Hadoop FileSystem based state-store implementation](#Configurations_for_Hadoop_FileSystem_based_state-store_implementation)
* [Configurations for ZooKeeper based state-store implementation](#Configurations_for_ZooKeeper_based_state-store_implementation)
* [Configurations for LevelDB based state-store implementation](#Configurations_for_LevelDB_based_state-store_implementation)
* [Configurations for work-preserving RM recovery](#Configurations_for_work-preserving_RM_recovery)
* [Notes](#Notes)
* [Sample Configurations](#Sample_Configurations)
Overview
--------
ResourceManager is the central authority that manages resources and schedules applications running atop of YARN. Hence, it is potentially a single point of failure in a Apache YARN cluster.
`
This document gives an overview of ResourceManager Restart, a feature that enhances ResourceManager to keep functioning across restarts and also makes ResourceManager down-time invisible to end-users.
ResourceManager Restart feature is divided into two phases:
* **ResourceManager Restart Phase 1 (Non-work-preserving RM restart)**: Enhance RM to persist application/attempt state and other credentials information in a pluggable state-store. RM will reload this information from state-store upon restart and re-kick the previously running applications. Users are not required to re-submit the applications.
* **ResourceManager Restart Phase 2 (Work-preserving RM restart)**: Focus on re-constructing the running state of ResourceManager by combining the container statuses from NodeManagers and container requests from ApplicationMasters upon restart. The key difference from phase 1 is that previously running applications will not be killed after RM restarts, and so applications won't lose its work because of RM outage.
Feature
-------
* **Phase 1: Non-work-preserving RM restart**
As of Hadoop 2.4.0 release, only ResourceManager Restart Phase 1 is implemented which is described below.
The overall concept is that RM will persist the application metadata (i.e. ApplicationSubmissionContext) in a pluggable state-store when client submits an application and also saves the final status of the application such as the completion state (failed, killed, finished) and diagnostics when the application completes. Besides, RM also saves the credentials like security keys, tokens to work in a secure environment. Any time RM shuts down, as long as the required information (i.e.application metadata and the alongside credentials if running in a secure environment) is available in the state-store, when RM restarts, it can pick up the application metadata from the state-store and re-submit the application. RM won't re-submit the applications if they were already completed (i.e. failed, killed, finished) before RM went down.
NodeManagers and clients during the down-time of RM will keep polling RM until RM comes up. When RM becomes alive, it will send a re-sync command to all the NodeManagers and ApplicationMasters it was talking to via heartbeats. As of Hadoop 2.4.0 release, the behaviors for NodeManagers and ApplicationMasters to handle this command are: NMs will kill all its managed containers and re-register with RM. From the RM's perspective, these re-registered NodeManagers are similar to the newly joining NMs. AMs(e.g. MapReduce AM) are expected to shutdown when they receive the re-sync command. After RM restarts and loads all the application metadata, credentials from state-store and populates them into memory, it will create a new attempt (i.e. ApplicationMaster) for each application that was not yet completed and re-kick that application as usual. As described before, the previously running applications' work is lost in this manner since they are essentially killed by RM via the re-sync command on restart.
* **Phase 2: Work-preserving RM restart**
As of Hadoop 2.6.0, we further enhanced RM restart feature to address the problem to not kill any applications running on YARN cluster if RM restarts.
Beyond all the groundwork that has been done in Phase 1 to ensure the persistency of application state and reload that state on recovery, Phase 2 primarily focuses on re-constructing the entire running state of YARN cluster, the majority of which is the state of the central scheduler inside RM which keeps track of all containers' life-cycle, applications' headroom and resource requests, queues' resource usage etc. In this way, RM doesn't need to kill the AM and re-run the application from scratch as it is done in Phase 1. Applications can simply re-sync back with RM and resume from where it were left off.
RM recovers its runing state by taking advantage of the container statuses sent from all NMs. NM will not kill the containers when it re-syncs with the restarted RM. It continues managing the containers and send the container statuses across to RM when it re-registers. RM reconstructs the container instances and the associated applications' scheduling status by absorbing these containers' information. In the meantime, AM needs to re-send the outstanding resource requests to RM because RM may lose the unfulfilled requests when it shuts down. Application writers using AMRMClient library to communicate with RM do not need to worry about the part of AM re-sending resource requests to RM on re-sync, as it is automatically taken care by the library itself.
Configurations
--------------
This section describes the configurations involved to enable RM Restart feature.
### Enable RM Restart
| Property | Description |
|:---- |:---- |
| `yarn.resourcemanager.recovery.enabled` | `true` |
### Configure the state-store for persisting the RM state
| Property | Description |
|:---- |:---- |
| `yarn.resourcemanager.store.class` | The class name of the state-store to be used for saving application/attempt state and the credentials. The available state-store implementations are `org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore`, a ZooKeeper based state-store implementation and `org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore`, a Hadoop FileSystem based state-store implementation like HDFS and local FS. `org.apache.hadoop.yarn.server.resourcemanager.recovery.LeveldbRMStateStore`, a LevelDB based state-store implementation. The default value is set to `org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore`. |
### How to choose the state-store implementation
* **ZooKeeper based state-store**: User is free to pick up any storage to set up RM restart, but must use ZooKeeper based state-store to support RM HA. The reason is that only ZooKeeper based state-store supports fencing mechanism to avoid a split-brain situation where multiple RMs assume they are active and can edit the state-store at the same time.
* **FileSystem based state-store**: HDFS and local FS based state-store are supported. Fencing mechanism is not supported.
* **LevelDB based state-store**: LevelDB based state-store is considered more light weight than HDFS and ZooKeeper based state-store. LevelDB supports better atomic operations, fewer I/O ops per state update,
and far fewer total files on the filesystem. Fencing mechanism is not supported.
### Configurations for Hadoop FileSystem based state-store implementation
Support both HDFS and local FS based state-store implementation. The type of file system to be used is determined by the scheme of URI. e.g. `hdfs://localhost:9000/rmstore` uses HDFS as the storage and `file:///tmp/yarn/rmstore` uses local FS as the storage. If no scheme(`hdfs://` or `file://`) is specified in the URI, the type of storage to be used is determined by `fs.defaultFS` defined in `core-site.xml`.
* Configure the URI where the RM state will be saved in the Hadoop FileSystem state-store.
| Property | Description |
|:---- |:---- |
| `yarn.resourcemanager.fs.state-store.uri` | URI pointing to the location of the FileSystem path where RM state will be stored (e.g. hdfs://localhost:9000/rmstore). Default value is `${hadoop.tmp.dir}/yarn/system/rmstore`. If FileSystem name is not provided, `fs.default.name` specified in **conf/core-site.xml* will be used. |
* Configure the retry policy state-store client uses to connect with the Hadoop FileSystem.
| Property | Description |
|:---- |:---- |
| `yarn.resourcemanager.fs.state-store.retry-policy-spec` | Hadoop FileSystem client retry policy specification. Hadoop FileSystem client retry is always enabled. Specified in pairs of sleep-time and number-of-retries i.e. (t0, n0), (t1, n1), ..., the first n0 retries sleep t0 milliseconds on average, the following n1 retries sleep t1 milliseconds on average, and so on. Default value is (2000, 500) |
### Configurations for ZooKeeper based state-store implementation
* Configure the ZooKeeper server address and the root path where the RM state is stored.
| Property | Description |
|:---- |:---- |
| `yarn.resourcemanager.zk-address` | Comma separated list of Host:Port pairs. Each corresponds to a ZooKeeper server (e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002") to be used by the RM for storing RM state. |
| `yarn.resourcemanager.zk-state-store.parent-path` | The full path of the root znode where RM state will be stored. Default value is /rmstore. |
* Configure the retry policy state-store client uses to connect with the ZooKeeper server.
| Property | Description |
|:---- |:---- |
| `yarn.resourcemanager.zk-num-retries` | Number of times RM tries to connect to ZooKeeper server if the connection is lost. Default value is 500. |
| `yarn.resourcemanager.zk-retry-interval-ms` | The interval in milliseconds between retries when connecting to a ZooKeeper server. Default value is 2 seconds. |
| `yarn.resourcemanager.zk-timeout-ms` | ZooKeeper session timeout in milliseconds. This configuration is used by the ZooKeeper server to determine when the session expires. Session expiration happens when the server does not hear from the client (i.e. no heartbeat) within the session timeout period specified by this configuration. Default value is 10 seconds |
* Configure the ACLs to be used for setting permissions on ZooKeeper znodes.
| Property | Description |
|:---- |:---- |
| `yarn.resourcemanager.zk-acl` | ACLs to be used for setting permissions on ZooKeeper znodes. Default value is `world:anyone:rwcda` |
### Configurations for LevelDB based state-store implementation
| Property | Description |
|:---- |:---- |
| `yarn.resourcemanager.leveldb-state-store.path` | Local path where the RM state will be stored. Default value is `${hadoop.tmp.dir}/yarn/system/rmstore` |
### Configurations for work-preserving RM recovery
| Property | Description |
|:---- |:---- |
| `yarn.resourcemanager.work-preserving-recovery.scheduling-wait-ms` | Set the amount of time RM waits before allocating new containers on RM work-preserving recovery. Such wait period gives RM a chance to settle down resyncing with NMs in the cluster on recovery, before assigning new containers to applications.|
Notes
-----
ContainerId string format is changed if RM restarts with work-preserving recovery enabled. It used to be such format:
Container_{clusterTimestamp}_{appId}_{attemptId}_{containerId}, e.g. Container_1410901177871_0001_01_000005.
It is now changed to:
Container_e{epoch}_{clusterTimestamp}_{appId}_{attemptId}_{containerId}, e.g. Container_e17_1410901177871_0001_01_000005.
Here, the additional epoch number is a monotonically increasing integer which starts from 0 and is increased by 1 each time RM restarts. If epoch number is 0, it is omitted and the containerId string format stays the same as before.
Sample Configurations
---------------------
Below is a minimum set of configurations for enabling RM work-preserving restart using ZooKeeper based state store.
<property>
<description>Enable RM to recover state after starting. If true, then
yarn.resourcemanager.store.class must be specified</description>
<name>yarn.resourcemanager.recovery.enabled</name>
<value>true</value>
</property>
<property>
<description>The class to use as the persistent store.</description>
<name>yarn.resourcemanager.store.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
</property>
<property>
<description>Comma separated list of Host:Port pairs. Each corresponds to a ZooKeeper server
(e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002") to be used by the RM for storing RM state.
This must be supplied when using org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore
as the value for yarn.resourcemanager.store.class</description>
<name>yarn.resourcemanager.zk-address</name>
<value>127.0.0.1:2181</value>
</property>

View File

@ -0,0 +1,135 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
YARN Secure Containers
======================
* [Overview](#Overview)
Overview
--------
YARN containers in a secure cluster use the operating system facilities to offer execution isolation for containers. Secure containers execute under the credentials of the job user. The operating system enforces access restriction for the container. The container must run as the use that submitted the application.
Secure Containers work only in the context of secured YARN clusters.
###Container isolation requirements
The container executor must access the local files and directories needed by the container such as jars, configuration files, log files, shared objects etc. Although it is launched by the NodeManager, the container should not have access to the NodeManager private files and configuration. Container running applications submitted by different users should be isolated and unable to access each other files and directories. Similar requirements apply to other system non-file securable objects like named pipes, critical sections, LPC queues, shared memory etc.
###Linux Secure Container Executor
On Linux environment the secure container executor is the `LinuxContainerExecutor`. It uses an external program called the **container-executor**\> to launch the container. This program has the `setuid` access right flag set which allows it to launch the container with the permissions of the YARN application user.
###Configuration
The configured directories for `yarn.nodemanager.local-dirs` and `yarn.nodemanager.log-dirs` must be owned by the configured NodeManager user (`yarn`) and group (`hadoop`). The permission set on these directories must be `drwxr-xr-x`.
The `container-executor` program must be owned by `root` and have the permission set `---sr-s---`.
To configure the `NodeManager` to use the `LinuxContainerExecutor` set the following in the **conf/yarn-site.xml**:
```xml
<property>
<name>yarn.nodemanager.container-executor.class</name>
<value>org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor</value>
</property>
<property>
<name>yarn.nodemanager.linux-container-executor.group</name>
<value>hadoop</value>
</property>
```
Additionally the LCE requires the `container-executor.cfg` file, which is read by the `container-executor` program.
```
yarn.nodemanager.linux-container-executor.group=#configured value of yarn.nodemanager.linux-container-executor.group
banned.users=#comma separated list of users who can not run applications
allowed.system.users=#comma separated list of allowed system users
min.user.id=1000#Prevent other super-users
```
###Windows Secure Container Executor (WSCE)
The Windows environment secure container executor is the `WindowsSecureContainerExecutor`. It uses the Windows S4U infrastructure to launch the container as the YARN application user. The WSCE requires the presense of the `hadoopwinutilsvc` service. This services is hosted by `%HADOOP_HOME%\bin\winutils.exe` started with the `service` command line argument. This service offers some privileged operations that require LocalSystem authority so that the NM is not required to run the entire JVM and all the NM code in an elevated context. The NM interacts with the `hadoopwintulsvc` service by means of Local RPC (LRPC) via calls JNI to the RCP client hosted in `hadoop.dll`.
###Configuration
To configure the `NodeManager` to use the `WindowsSecureContainerExecutor` set the following in the **conf/yarn-site.xml**:
```xml
<property>
<name>yarn.nodemanager.container-executor.class</name>
<value>org.apache.hadoop.yarn.server.nodemanager.WindowsSecureContainerExecutor</value>
</property>
<property>
<name>yarn.nodemanager.windows-secure-container-executor.group</name>
<value>yarn</value>
</property>
```
The hadoopwinutilsvc uses `%HADOOP_HOME%\etc\hadoop\wsce_site.xml` to configure access to the privileged operations.
```xml
<property>
<name>yarn.nodemanager.windows-secure-container-executor.impersonate.allowed</name>
<value>HadoopUsers</value>
</property>
<property>
<name>yarn.nodemanager.windows-secure-container-executor.impersonate.denied</name>
<value>HadoopServices,Administrators</value>
</property>
<property>
<name>yarn.nodemanager.windows-secure-container-executor.allowed</name>
<value>nodemanager</value>
</property>
<property>
<name>yarn.nodemanager.windows-secure-container-executor.local-dirs</name>
<value>nm-local-dir, nm-log-dirs</value>
</property>
<property>
<name>yarn.nodemanager.windows-secure-container-executor.job-name</name>
<value>nodemanager-job-name</value>
</property>
```
`yarn.nodemanager.windows-secure-container-executor.allowed` should contain the name of the service account running the nodemanager. This user will be allowed to access the hadoopwintuilsvc functions.
`yarn.nodemanager.windows-secure-container-executor.impersonate.allowed` should contain the users that are allowed to create containers in the cluster. These users will be allowed to be impersonated by hadoopwinutilsvc.
`yarn.nodemanager.windows-secure-container-executor.impersonate.denied` should contain users that are explictly forbiden from creating containers. hadoopwinutilsvc will refuse to impersonate these users.
`yarn.nodemanager.windows-secure-container-executor.local-dirs` should contain the nodemanager local dirs. hadoopwinutilsvc will allow only file operations under these directories. This should contain the same values as `$yarn.nodemanager.local-dirs, $yarn.nodemanager.log-dirs` but note that hadoopwinutilsvc XML configuration processing does not do substitutions so the value must be the final value. All paths must be absolute and no environment variable substitution will be performed. The paths are compared LOCAL\_INVARIANT case insensitive string comparison, the file path validated must start with one of the paths listed in local-dirs configuration. Use comma as path separator:`,`
`yarn.nodemanager.windows-secure-container-executor.job-name` should contain an Windows NT job name that all containers should be added to. This configuration is optional. If not set, the container is not added to a global NodeManager job. Normally this should be set to the job that the NM is assigned to, so that killing the NM kills also all containers. Hadoopwinutilsvc will not attempt to create this job, the job must exists when the container is launched. If the value is set and the job does not exists, container launch will fail with error 2 `The system cannot find the file specified`. Note that this global NM job is not related to the container job, which always gets created for each container and is named after the container ID. This setting controls a global job that spans all containers and the parent NM, and as such it requires nested jobs. Nested jobs are available only post Windows 8 and Windows Server 2012.
####Useful Links
* [Exploring S4U Kerberos Extensions in Windows Server 2003](http://msdn.microsoft.com/en-us/magazine/cc188757.aspx)
* [Nested Jobs](http://msdn.microsoft.com/en-us/library/windows/desktop/hh448388.aspx)
* [Winutils needs ability to create task as domain user](https://issues.apache.org/jira/browse/YARN-1063)
* [Implement secure Windows Container Executor](https://issues.apache.org/jira/browse/YARN-1972)
* [Remove the need to run NodeManager as privileged account for Windows Secure Container Executor](https://issues.apache.org/jira/browse/YARN-2198)

View File

@ -0,0 +1,231 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
YARN Timeline Server
====================
* [Overview](#Overview)
* [Current Status](#Current_Status)
* [Basic Configuration](#Basic_Configuration)
* [Advanced Configuration](#Advanced_Configuration)
* [Generic-data related Configuration](#Generic-data_related_Configuration)
* [Per-framework-date related Configuration](#Per-framework-date_related_Configuration)
* [Running Timeline server](#Running_Timeline_server)
* [Accessing generic-data via command-line](#Accessing_generic-data_via_command-line)
* [Publishing of per-framework data by applications](#Publishing_of_per-framework_data_by_applications)
Overview
--------
Storage and retrieval of applications' current as well as historic information in a generic fashion is solved in YARN through the Timeline Server (previously also called Generic Application History Server). This serves two responsibilities:
* Generic information about completed applications
Generic information includes application level data like queue-name, user information etc in the ApplicationSubmissionContext, list of application-attempts that ran for an application, information about each application-attempt, list of containers run under each application-attempt, and information about each container. Generic data is stored by ResourceManager to a history-store (default implementation on a file-system) and used by the web-UI to display information about completed applications.
* Per-framework information of running and completed applications
Per-framework information is completely specific to an application or framework. For example, Hadoop MapReduce framework can include pieces of information like number of map tasks, reduce tasks, counters etc. Application developers can publish the specific information to the Timeline server via TimelineClient from within a client, the ApplicationMaster and/or the application's containers. This information is then queryable via REST APIs for rendering by application/framework specific UIs.
Current Status
--------------
Timeline sever is a work in progress. The basic storage and retrieval of information, both generic and framework specific, are in place. Timeline server doesn't work in secure mode yet. The generic information and the per-framework information are today collected and presented separately and thus are not integrated well together. Finally, the per-framework information is only available via RESTful APIs, using JSON type content - ability to install framework specific UIs in YARN isn't supported yet.
Basic Configuration
-------------------
Users need to configure the Timeline server before starting it. The simplest configuration you should add in `yarn-site.xml` is to set the hostname of the Timeline server.
```xml
<property>
<description>The hostname of the Timeline service web application.</description>
<name>yarn.timeline-service.hostname</name>
<value>0.0.0.0</value>
</property>
```
Advanced Configuration
----------------------
In addition to the hostname, admins can also configure whether the service is enabled or not, the ports of the RPC and the web interfaces, and the number of RPC handler threads.
```xml
<property>
<description>Address for the Timeline server to start the RPC server.</description>
<name>yarn.timeline-service.address</name>
<value>${yarn.timeline-service.hostname}:10200</value>
</property>
<property>
<description>The http address of the Timeline service web application.</description>
<name>yarn.timeline-service.webapp.address</name>
<value>${yarn.timeline-service.hostname}:8188</value>
</property>
<property>
<description>The https address of the Timeline service web application.</description>
<name>yarn.timeline-service.webapp.https.address</name>
<value>${yarn.timeline-service.hostname}:8190</value>
</property>
<property>
<description>Handler thread count to serve the client RPC requests.</description>
<name>yarn.timeline-service.handler-thread-count</name>
<value>10</value>
</property>
<property>
<description>Enables cross-origin support (CORS) for web services where
cross-origin web response headers are needed. For example, javascript making
a web services request to the timeline server.</description>
<name>yarn.timeline-service.http-cross-origin.enabled</name>
<value>false</value>
</property>
<property>
<description>Comma separated list of origins that are allowed for web
services needing cross-origin (CORS) support. Wildcards (*) and patterns
allowed</description>
<name>yarn.timeline-service.http-cross-origin.allowed-origins</name>
<value>*</value>
</property>
<property>
<description>Comma separated list of methods that are allowed for web
services needing cross-origin (CORS) support.</description>
<name>yarn.timeline-service.http-cross-origin.allowed-methods</name>
<value>GET,POST,HEAD</value>
</property>
<property>
<description>Comma separated list of headers that are allowed for web
services needing cross-origin (CORS) support.</description>
<name>yarn.timeline-service.http-cross-origin.allowed-headers</name>
<value>X-Requested-With,Content-Type,Accept,Origin</value>
</property>
<property>
<description>The number of seconds a pre-flighted request can be cached
for web services needing cross-origin (CORS) support.</description>
<name>yarn.timeline-service.http-cross-origin.max-age</name>
<value>1800</value>
</property>
```
Generic-data related Configuration
----------------------------------
Users can specify whether the generic data collection is enabled or not, and also choose the storage-implementation class for the generic data. There are more configurations related to generic data collection, and users can refer to `yarn-default.xml` for all of them.
```xml
<property>
<description>Indicate to ResourceManager as well as clients whether
history-service is enabled or not. If enabled, ResourceManager starts
recording historical data that Timelien service can consume. Similarly,
clients can redirect to the history service when applications
finish if this is enabled.</description>
<name>yarn.timeline-service.generic-application-history.enabled</name>
<value>false</value>
</property>
<property>
<description>Store class name for history store, defaulting to file system
store</description>
<name>yarn.timeline-service.generic-application-history.store-class</name>
<value>org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore</value>
</property>
```
Per-framework-date related Configuration
----------------------------------------
Users can specify whether per-framework data service is enabled or not, choose the store implementation for the per-framework data, and tune the retention of the per-framework data. There are more configurations related to per-framework data service, and users can refer to `yarn-default.xml` for all of them.
```xml
<property>
<description>Indicate to clients whether Timeline service is enabled or not.
If enabled, the TimelineClient library used by end-users will post entities
and events to the Timeline server.</description>
<name>yarn.timeline-service.enabled</name>
<value>true</value>
</property>
<property>
<description>Store class name for timeline store.</description>
<name>yarn.timeline-service.store-class</name>
<value>org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore</value>
</property>
<property>
<description>Enable age off of timeline store data.</description>
<name>yarn.timeline-service.ttl-enable</name>
<value>true</value>
</property>
<property>
<description>Time to live for timeline store data in milliseconds.</description>
<name>yarn.timeline-service.ttl-ms</name>
<value>604800000</value>
</property>
```
Running Timeline server
-----------------------
Assuming all the aforementioned configurations are set properly, admins can start the Timeline server/history service with the following command:
$ yarn timelineserver
Or users can start the Timeline server / history service as a daemon:
$ yarn --daemon start timelineserver
Accessing generic-data via command-line
---------------------------------------
Users can access applications' generic historic data via the command line as below. Note that the same commands are usable to obtain the corresponding information about running applications.
```
$ yarn application -status <Application ID>
$ yarn applicationattempt -list <Application ID>
$ yarn applicationattempt -status <Application Attempt ID>
$ yarn container -list <Application Attempt ID>
$ yarn container -status <Container ID>
```
Publishing of per-framework data by applications
------------------------------------------------
Developers can define what information they want to record for their applications by composing `TimelineEntity` and `TimelineEvent` objects, and put the entities and events to the Timeline server via `TimelineClient`. Following is an example:
```java
// Create and start the Timeline client
TimelineClient client = TimelineClient.createTimelineClient();
client.init(conf);
client.start();
TimelineEntity entity = null;
// Compose the entity
try {
TimelinePutResponse response = client.putEntities(entity);
} catch (IOException e) {
// Handle the exception
} catch (YarnException e) {
// Handle the exception
}
// Stop the Timeline client
client.stop();
```

View File

@ -0,0 +1,24 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
Web Application Proxy
=====================
The Web Application Proxy is part of YARN. By default it will run as part of the Resource Manager(RM), but can be configured to run in stand alone mode. The reason for the proxy is to reduce the possibility of web based attacks through YARN.
In YARN the Application Master(AM) has the responsibility to provide a web UI and to send that link to the RM. This opens up a number of potential issues. The RM runs as a trusted user, and people visiting that web address will treat it, and links it provides to them as trusted, when in reality the AM is running as a non-trusted user, and the links it gives to the RM could point to anything malicious or otherwise. The Web Application Proxy mitigates this risk by warning users that do not own the given application that they are connecting to an untrusted site.
In addition to this the proxy also tries to reduce the impact that a malicious AM could have on a user. It primarily does this by stripping out cookies from the user, and replacing them with a single cookie providing the user name of the logged in user. This is because most web based authentication systems will identify a user based off of a cookie. By providing this cookie to an untrusted application it opens up the potential for an exploit. If the cookie is designed properly that potential should be fairly minimal, but this is just to reduce that potential attack vector. The current proxy implementation does nothing to prevent the AM from providing links to malicious external sites, nor does it do anything to prevent malicious javascript code from running as well. In fact javascript can be used to get the cookies, so stripping the cookies from the request has minimal benefit at this time.
In the future we hope to address the attack vectors described above and make attaching to an AM's web UI safer.

View File

@ -1,119 +1,116 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
---
Hadoop YARN - Introduction to the web services REST API's.
---
---
${maven.build.timestamp}
http://www.apache.org/licenses/LICENSE-2.0
Hadoop YARN - Introduction to the web services REST API's.
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
%{toc|section=1|fromDepth=0}
Hadoop YARN - Introduction to the web services REST API's
==========================================================
* Overview
* [Overview](#Overview)
* [URI's](#URIs)
* [HTTP Requests](#HTTP_Requests)
* [Summary of HTTP operations](#Summary_of_HTTP_operations)
* [Security](#Security)
* [Headers Supported](#Headers_Supported)
* [HTTP Responses](#HTTP_Responses)
* [Compression](#Compression)
* [Response Formats](#Response_Formats)
* [Response Errors](#Response_Errors)
* [Response Examples](#Response_Examples)
* [Sample Usage](#Sample_Usage)
The Hadoop YARN web service REST APIs are a set of URI resources that give access to the cluster, nodes, applications, and application historical information. The URI resources are grouped into APIs based on the type of information returned. Some URI resources return collections while others return singletons.
* URI's
Overview
--------
The URIs for the REST-based Web services have the following syntax:
The Hadoop YARN web service REST APIs are a set of URI resources that give access to the cluster, nodes, applications, and application historical information. The URI resources are grouped into APIs based on the type of information returned. Some URI resources return collections while others return singletons.
------
http://{http address of service}/ws/{version}/{resourcepath}
------
The elements in this syntax are as follows:
------
{http address of service} - The http address of the service to get information about.
Currently supported are the ResourceManager, NodeManager,
MapReduce application master, and history server.
{version} - The version of the APIs. In this release, the version is v1.
{resourcepath} - A path that defines a singleton resource or a collection of resources.
------
* HTTP Requests
To invoke a REST API, your application calls an HTTP operation on the URI associated with a resource.
** Summary of HTTP operations
Currently only GET is supported. It retrieves information about the resource specified.
** Security
The web service REST API's go through the same security as the web ui. If your cluster adminstrators have filters enabled you must authenticate via the mechanism they specified.
** Headers Supported
-----
* Accept
* Accept-Encoding
URI's
-----
Currently the only fields used in the header is Accept and Accept-Encoding. Accept currently supports XML and JSON for the response type you accept. Accept-Encoding currently only supports gzip format and will return gzip compressed output if this is specified, otherwise output is uncompressed. All other header fields are ignored.
The URIs for the REST-based Web services have the following syntax:
* HTTP Responses
http://{http address of service}/ws/{version}/{resourcepath}
The next few sections describe some of the syntax and other details of the HTTP Responses of the web service REST APIs.
The elements in this syntax are as follows:
** Compression
{http address of service} - The http address of the service to get information about.
Currently supported are the ResourceManager, NodeManager,
MapReduce application master, and history server.
{version} - The version of the APIs. In this release, the version is v1.
{resourcepath} - A path that defines a singleton resource or a collection of resources.
This release supports gzip compression if you specify gzip in the Accept-Encoding header of the HTTP request (Accept-Encoding: gzip).
HTTP Requests
-------------
** Response Formats
To invoke a REST API, your application calls an HTTP operation on the URI associated with a resource.
This release of the web service REST APIs supports responses in JSON and XML formats. JSON is the default. To set the response format, you can specify the format in the Accept header of the HTTP request.
### Summary of HTTP operations
As specified in HTTP Response Codes, the response body can contain the data that represents the resource or an error message. In the case of success, the response body is in the selected format, either JSON or XML. In the case of error, the resonse body is in either JSON or XML based on the format requested. The Content-Type header of the response contains the format requested. If the application requests an unsupported format, the response status code is 500.
Note that the order of the fields within response body is not specified and might change. Also, additional fields might be added to a response body. Therefore, your applications should use parsing routines that can extract data from a response body in any order.
Currently only GET is supported. It retrieves information about the resource specified.
** Response Errors
### Security
After calling an HTTP request, an application should check the response status code to verify success or detect an error. If the response status code indicates an error, the response body contains an error message. The first field is the exception type, currently only RemoteException is returned. The following table lists the items within the RemoteException error message:
The web service REST API's go through the same security as the web UI. If your cluster adminstrators have filters enabled you must authenticate via the mechanism they specified.
*---------------*--------------*-------------------------------*
|| Item || Data Type || Description |
*---------------+--------------+-------------------------------+
| exception | String | Exception type |
*---------------+--------------+-------------------------------+
| javaClassName | String | Java class name of exception |
*---------------+--------------+-------------------------------+
| message | String | Detailed message of exception |
*---------------+--------------+-------------------------------+
### Headers Supported
** Response Examples
* Accept
* Accept-Encoding
*** JSON response with single resource
Currently the only fields used in the header is `Accept` and `Accept-Encoding`. `Accept` currently supports XML and JSON for the response type you accept. `Accept-Encoding` currently supports only gzip format and will return gzip compressed output if this is specified, otherwise output is uncompressed. All other header fields are ignored.
HTTP Request:
GET http://rmhost.domain:8088/ws/v1/cluster/app/application_1324057493980_0001
HTTP Responses
--------------
Response Status Line:
HTTP/1.1 200 OK
The next few sections describe some of the syntax and other details of the HTTP Responses of the web service REST APIs.
Response Header:
### Compression
+---+
HTTP/1.1 200 OK
Content-Type: application/json
Transfer-Encoding: chunked
Server: Jetty(6.1.26)
+---+
This release supports gzip compression if you specify gzip in the Accept-Encoding header of the HTTP request (Accept-Encoding: gzip).
Response Body:
### Response Formats
+---+
This release of the web service REST APIs supports responses in JSON and XML formats. JSON is the default. To set the response format, you can specify the format in the Accept header of the HTTP request.
As specified in HTTP Response Codes, the response body can contain the data that represents the resource or an error message. In the case of success, the response body is in the selected format, either JSON or XML. In the case of error, the resonse body is in either JSON or XML based on the format requested. The Content-Type header of the response contains the format requested. If the application requests an unsupported format, the response status code is 500. Note that the order of the fields within response body is not specified and might change. Also, additional fields might be added to a response body. Therefore, your applications should use parsing routines that can extract data from a response body in any order.
### Response Errors
After calling an HTTP request, an application should check the response status code to verify success or detect an error. If the response status code indicates an error, the response body contains an error message. The first field is the exception type, currently only RemoteException is returned. The following table lists the items within the RemoteException error message:
| Item | Data Type | Description |
|:---- |:---- |:---- |
| exception | String | Exception type |
| javaClassName | String | Java class name of exception |
| message | String | Detailed message of exception |
### Response Examples
#### JSON response with single resource
HTTP Request: GET http://rmhost.domain:8088/ws/v1/cluster/app/application\_1324057493980\_0001
Response Status Line: HTTP/1.1 200 OK
Response Header:
HTTP/1.1 200 OK
Content-Type: application/json
Transfer-Encoding: chunked
Server: Jetty(6.1.26)
Response Body:
```json
{
app":
{
@ -134,30 +131,26 @@ Note that the order of the fields within response body is not specified and migh
"amHostHttpAddress":"amNM:2"
}
}
+---+
```
*** JSON response with Error response
#### JSON response with Error response
Here we request information about an application that doesn't exist yet.
Here we request information about an application that doesn't exist yet.
HTTP Request:
GET http://rmhost.domain:8088/ws/v1/cluster/app/application_1324057493980_9999
HTTP Request: GET http://rmhost.domain:8088/ws/v1/cluster/app/application\_1324057493980\_9999
Response Status Line:
HTTP/1.1 404 Not Found
Response Status Line: HTTP/1.1 404 Not Found
Response Header:
Response Header:
+---+
HTTP/1.1 404 Not Found
Content-Type: application/json
Transfer-Encoding: chunked
Server: Jetty(6.1.26)
+---+
HTTP/1.1 404 Not Found
Content-Type: application/json
Transfer-Encoding: chunked
Server: Jetty(6.1.26)
Response Body:
Response Body:
+---+
```json
{
"RemoteException" : {
"javaClassName" : "org.apache.hadoop.yarn.webapp.NotFoundException",
@ -165,37 +158,32 @@ Note that the order of the fields within response body is not specified and migh
"message" : "java.lang.Exception: app with id: application_1324057493980_9999 not found"
}
}
+---+
```
* Example usage
Sample Usage
-------------
You can use any number of ways/languages to use the web services REST API's. This example uses the curl command line interface to do the REST GET calls.
You can use any number of ways/languages to use the web services REST API's. This example uses the curl command line interface to do the REST GET calls.
In this example, a user submits a MapReduce application to the ResourceManager using a command like:
+---+
hadoop jar hadoop-mapreduce-test.jar sleep -Dmapred.job.queue.name=a1 -m 1 -r 1 -rt 1200000 -mt 20
+---+
In this example, a user submits a MapReduce application to the ResourceManager using a command like:
The client prints information about the job submitted along with the application id, similar to:
hadoop jar hadoop-mapreduce-test.jar sleep -Dmapred.job.queue.name=a1 -m 1 -r 1 -rt 1200000 -mt 20
+---+
12/01/18 04:25:15 INFO mapred.ResourceMgrDelegate: Submitted application application_1326821518301_0010 to ResourceManager at host.domain.com/10.10.10.10:8032
12/01/18 04:25:15 INFO mapreduce.Job: Running job: job_1326821518301_0010
12/01/18 04:25:21 INFO mapred.ClientServiceDelegate: The url to track the job: host.domain.com:8088/proxy/application_1326821518301_0010/
12/01/18 04:25:22 INFO mapreduce.Job: Job job_1326821518301_0010 running in uber mode : false
12/01/18 04:25:22 INFO mapreduce.Job: map 0% reduce 0%
+---+
The client prints information about the job submitted along with the application id, similar to:
The user then wishes to track the application. The users starts by getting the information about the application from the ResourceManager. Use the --comopressed option to request output compressed. curl handles uncompressing on client side.
12/01/18 04:25:15 INFO mapred.ResourceMgrDelegate: Submitted application application_1326821518301_0010 to ResourceManager at host.domain.com/10.10.10.10:8032
12/01/18 04:25:15 INFO mapreduce.Job: Running job: job_1326821518301_0010
12/01/18 04:25:21 INFO mapred.ClientServiceDelegate: The url to track the job: host.domain.com:8088/proxy/application_1326821518301_0010/
12/01/18 04:25:22 INFO mapreduce.Job: Job job_1326821518301_0010 running in uber mode : false
12/01/18 04:25:22 INFO mapreduce.Job: map 0% reduce 0%
+---+
curl --compressed -H "Accept: application/json" -X GET "http://host.domain.com:8088/ws/v1/cluster/apps/application_1326821518301_0010"
+---+
The user then wishes to track the application. The users starts by getting the information about the application from the ResourceManager. Use the --comopressed option to request output compressed. curl handles uncompressing on client side.
Output:
curl --compressed -H "Accept: application/json" -X GET "http://host.domain.com:8088/ws/v1/cluster/apps/application_1326821518301_0010"
+---+
Output:
```json
{
"app" : {
"finishedTime" : 0,
@ -216,17 +204,15 @@ curl --compressed -H "Accept: application/json" -X GET "http://host.domain.com:8
"queue" : "a1"
}
}
+---+
```
The user then wishes to get more details about the running application and goes directly to the MapReduce application master for this application. The ResourceManager lists the trackingUrl that can be used for this application: http://host.domain.com:8088/proxy/application_1326821518301_0010. This could either go to the web browser or use the web service REST API's. The user uses the web services REST API's to get the list of jobs this MapReduce application master is running:
The user then wishes to get more details about the running application and goes directly to the MapReduce application master for this application. The ResourceManager lists the trackingUrl that can be used for this application: http://host.domain.com:8088/proxy/application\_1326821518301\_0010. This could either go to the web browser or use the web service REST API's. The user uses the web services REST API's to get the list of jobs this MapReduce application master is running:
+---+
curl --compressed -H "Accept: application/json" -X GET "http://host.domain.com:8088/proxy/application_1326821518301_0010/ws/v1/mapreduce/jobs"
+---+
curl --compressed -H "Accept: application/json" -X GET "http://host.domain.com:8088/proxy/application_1326821518301_0010/ws/v1/mapreduce/jobs"
Output:
Output:
+---+
```json
{
"jobs" : {
"job" : [
@ -274,17 +260,15 @@ curl --compressed -H "Accept: application/json" -X GET "http://host.domain.com:8
]
}
}
+---+
```
The user then wishes to get the task details about the job with job id job_1326821518301_10_10 that was listed above.
The user then wishes to get the task details about the job with job id job\_1326821518301\_10\_10 that was listed above.
+---+
curl --compressed -H "Accept: application/json" -X GET "http://host.domain.com:8088/proxy/application_1326821518301_0010/ws/v1/mapreduce/jobs/job_1326821518301_10_10/tasks"
+---+
curl --compressed -H "Accept: application/json" -X GET "http://host.domain.com:8088/proxy/application_1326821518301_0010/ws/v1/mapreduce/jobs/job_1326821518301_10_10/tasks"
Output:
Output:
+---+
```json
{
"tasks" : {
"task" : [
@ -311,17 +295,15 @@ curl --compressed -H "Accept: application/json" -X GET "http://host.domain.com:8
]
}
}
+---+
```
The map task has finished but the reduce task is still running. The users wishes to get the task attempt information for the reduce task task_1326821518301_10_10_r_0, note that the Accept header isn't really required here since JSON is the default output format:
The map task has finished but the reduce task is still running. The users wishes to get the task attempt information for the reduce task task\_1326821518301\_10\_10\_r\_0, note that the Accept header isn't really required here since JSON is the default output format:
+---+
curl --compressed -X GET "http://host.domain.com:8088/proxy/application_1326821518301_0010/ws/v1/mapreduce/jobs/job_1326821518301_10_10/tasks/task_1326821518301_10_10_r_0/attempts"
+---+
curl --compressed -X GET "http://host.domain.com:8088/proxy/application_1326821518301_0010/ws/v1/mapreduce/jobs/job_1326821518301_10_10/tasks/task_1326821518301_10_10_r_0/attempts"
Output:
Output:
+---+
```json
{
"taskAttempts" : {
"taskAttempt" : [
@ -345,17 +327,15 @@ curl --compressed -H "Accept: application/json" -X GET "http://host.domain.com:8
]
}
}
+---+
```
The reduce attempt is still running and the user wishes to see the current counter values for that attempt:
The reduce attempt is still running and the user wishes to see the current counter values for that attempt:
+---+
curl --compressed -H "Accept: application/json" -X GET "http://host.domain.com:8088/proxy/application_1326821518301_0010/ws/v1/mapreduce/jobs/job_1326821518301_10_10/tasks/task_1326821518301_10_10_r_0/attempts/attempt_1326821518301_10_10_r_0_0/counters"
+---+
curl --compressed -H "Accept: application/json" -X GET "http://host.domain.com:8088/proxy/application_1326821518301_0010/ws/v1/mapreduce/jobs/job_1326821518301_10_10/tasks/task_1326821518301_10_10_r_0/attempts/attempt_1326821518301_10_10_r_0_0/counters"
Output:
Output:
+---+
```json
{
"JobTaskAttemptCounters" : {
"taskAttemptCounterGroup" : [
@ -511,17 +491,15 @@ curl --compressed -H "Accept: application/json" -X GET "http://host.domain.com:8
"id" : "attempt_1326821518301_10_10_r_0_0"
}
}
+---+
```
The job finishes and the user wishes to get the final job information from the history server for this job.
The job finishes and the user wishes to get the final job information from the history server for this job.
+---+
curl --compressed -X GET "http://host.domain.com:19888/ws/v1/history/mapreduce/jobs/job_1326821518301_10_10"
+---+
curl --compressed -X GET "http://host.domain.com:19888/ws/v1/history/mapreduce/jobs/job_1326821518301_10_10"
Output:
Output:
+---+
```json
{
"job" : {
"avgReduceTime" : 1250784,
@ -559,17 +537,15 @@ curl --compressed -H "Accept: application/json" -X GET "http://host.domain.com:8
"finishTime" : 1326861986164
}
}
+---+
```
The user also gets the final applications information from the ResourceManager.
The user also gets the final applications information from the ResourceManager.
+---+
curl --compressed -H "Accept: application/json" -X GET "http://host.domain.com:8088/ws/v1/cluster/apps/application_1326821518301_0010"
+---+
curl --compressed -H "Accept: application/json" -X GET "http://host.domain.com:8088/ws/v1/cluster/apps/application_1326821518301_0010"
Output:
Output:
+---+
```json
{
"app" : {
"finishedTime" : 1326861991282,
@ -590,4 +566,4 @@ curl --compressed -H "Accept: application/json" -X GET "http://host.domain.com:8
"queue" : "a1"
}
}
+---+
```

View File

@ -0,0 +1,591 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
Hadoop: Writing YARN Applications
=================================
* [Purpose](#Purpose)
* [Concepts and Flow](#Concepts_and_Flow)
* [Interfaces](#Interfaces)
* [Writing a Simple Yarn Application](#Writing_a_Simple_Yarn_Application)
* [Writing a simple Client](#Writing_a_simple_Client)
* [Writing an ApplicationMaster (AM)](#Writing_an_ApplicationMaster_AM)
* [FAQ](#FAQ)
* [How can I distribute my application's jars to all of the nodes in the YARN cluster that need it?](#How_can_I_distribute_my_applications_jars_to_all_of_the_nodes_in_the_YARN_cluster_that_need_it)
* [How do I get the ApplicationMaster's ApplicationAttemptId?](#How_do_I_get_the_ApplicationMasters_ApplicationAttemptId)
* [Why my container is killed by the NodeManager?](#Why_my_container_is_killed_by_the_NodeManager)
* [How do I include native libraries?](#How_do_I_include_native_libraries)
* [Useful Links](#Useful_Links)
* [Sample Code](#Sample_Code)
Purpose
-------
This document describes, at a high-level, the way to implement new Applications for YARN.
Concepts and Flow
-----------------
The general concept is that an *application submission client* submits an *application* to the YARN *ResourceManager* (RM). This can be done through setting up a `YarnClient` object. After `YarnClient` is started, the client can then set up application context, prepare the very first container of the application that contains the *ApplicationMaster* (AM), and then submit the application. You need to provide information such as the details about the local files/jars that need to be available for your application to run, the actual command that needs to be executed (with the necessary command line arguments), any OS environment settings (optional), etc. Effectively, you need to describe the Unix process(es) that needs to be launched for your ApplicationMaster.
The YARN ResourceManager will then launch the ApplicationMaster (as specified) on an allocated container. The ApplicationMaster communicates with YARN cluster, and handles application execution. It performs operations in an asynchronous fashion. During application launch time, the main tasks of the ApplicationMaster are: a) communicating with the ResourceManager to negotiate and allocate resources for future containers, and b) after container allocation, communicating YARN *NodeManager*s (NMs) to launch application containers on them. Task a) can be performed asynchronously through an `AMRMClientAsync` object, with event handling methods specified in a `AMRMClientAsync.CallbackHandler` type of event handler. The event handler needs to be set to the client explicitly. Task b) can be performed by launching a runnable object that then launches containers when there are containers allocated. As part of launching this container, the AM has to specify the `ContainerLaunchContext` that has the launch information such as command line specification, environment, etc.
During the execution of an application, the ApplicationMaster communicates NodeManagers through `NMClientAsync` object. All container events are handled by `NMClientAsync.CallbackHandler`, associated with `NMClientAsync`. A typical callback handler handles client start, stop, status update and error. ApplicationMaster also reports execution progress to ResourceManager by handling the `getProgress()` method of `AMRMClientAsync.CallbackHandler`.
Other than asynchronous clients, there are synchronous versions for certain workflows (`AMRMClient` and `NMClient`). The asynchronous clients are recommended because of (subjectively) simpler usages, and this article will mainly cover the asynchronous clients. Please refer to `AMRMClient` and `NMClient` for more information on synchronous clients.
Interfaces
----------
Following are the important interfaces:
* **Client**\<-\->**ResourceManager**
By using `YarnClient` objects.
* **ApplicationMaster**\<-\->**ResourceManager**
By using `AMRMClientAsync` objects, handling events asynchronously by `AMRMClientAsync.CallbackHandler`
* **ApplicationMaster**\<-\->**NodeManager**
Launch containers. Communicate with NodeManagers by using `NMClientAsync` objects, handling container events by `NMClientAsync.CallbackHandler`
**Note**
* The three main protocols for YARN application (ApplicationClientProtocol, ApplicationMasterProtocol and ContainerManagementProtocol) are still preserved. The 3 clients wrap these 3 protocols to provide simpler programming model for YARN applications.
* Under very rare circumstances, programmer may want to directly use the 3 protocols to implement an application. However, note that *such behaviors are no longer encouraged for general use cases*.
Writing a Simple Yarn Application
---------------------------------
### Writing a simple Client
* The first step that a client needs to do is to initialize and start a YarnClient.
YarnClient yarnClient = YarnClient.createYarnClient();
yarnClient.init(conf);
yarnClient.start();
* Once a client is set up, the client needs to create an application, and get its application id.
YarnClientApplication app = yarnClient.createApplication();
GetNewApplicationResponse appResponse = app.getNewApplicationResponse();
* The response from the `YarnClientApplication` for a new application also contains information about the cluster such as the minimum/maximum resource capabilities of the cluster. This is required so that to ensure that you can correctly set the specifications of the container in which the ApplicationMaster would be launched. Please refer to `GetNewApplicationResponse` for more details.
* The main crux of a client is to setup the `ApplicationSubmissionContext` which defines all the information needed by the RM to launch the AM. A client needs to set the following into the context:
* Application info: id, name
* Queue, priority info: Queue to which the application will be submitted, the priority to be assigned for the application.
* User: The user submitting the application
* `ContainerLaunchContext`: The information defining the container in which the AM will be launched and run. The `ContainerLaunchContext`, as mentioned previously, defines all the required information needed to run the application such as the local **R**esources (binaries, jars, files etc.), **E**nvironment settings (CLASSPATH etc.), the **C**ommand to be executed and security **T**okens (*RECT*).
```java
// set the application submission context
ApplicationSubmissionContext appContext = app.getApplicationSubmissionContext();
ApplicationId appId = appContext.getApplicationId();
appContext.setKeepContainersAcrossApplicationAttempts(keepContainers);
appContext.setApplicationName(appName);
// set local resources for the application master
// local files or archives as needed
// In this scenario, the jar file for the application master is part of the local resources
Map<String, LocalResource> localResources = new HashMap<String, LocalResource>();
LOG.info("Copy App Master jar from local filesystem and add to local environment");
// Copy the application master jar to the filesystem
// Create a local resource to point to the destination jar path
FileSystem fs = FileSystem.get(conf);
addToLocalResources(fs, appMasterJar, appMasterJarPath, appId.toString(),
localResources, null);
// Set the log4j properties if needed
if (!log4jPropFile.isEmpty()) {
addToLocalResources(fs, log4jPropFile, log4jPath, appId.toString(),
localResources, null);
}
// The shell script has to be made available on the final container(s)
// where it will be executed.
// To do this, we need to first copy into the filesystem that is visible
// to the yarn framework.
// We do not need to set this as a local resource for the application
// master as the application master does not need it.
String hdfsShellScriptLocation = "";
long hdfsShellScriptLen = 0;
long hdfsShellScriptTimestamp = 0;
if (!shellScriptPath.isEmpty()) {
Path shellSrc = new Path(shellScriptPath);
String shellPathSuffix =
appName + "/" + appId.toString() + "/" + SCRIPT_PATH;
Path shellDst =
new Path(fs.getHomeDirectory(), shellPathSuffix);
fs.copyFromLocalFile(false, true, shellSrc, shellDst);
hdfsShellScriptLocation = shellDst.toUri().toString();
FileStatus shellFileStatus = fs.getFileStatus(shellDst);
hdfsShellScriptLen = shellFileStatus.getLen();
hdfsShellScriptTimestamp = shellFileStatus.getModificationTime();
}
if (!shellCommand.isEmpty()) {
addToLocalResources(fs, null, shellCommandPath, appId.toString(),
localResources, shellCommand);
}
if (shellArgs.length > 0) {
addToLocalResources(fs, null, shellArgsPath, appId.toString(),
localResources, StringUtils.join(shellArgs, " "));
}
// Set the env variables to be setup in the env where the application master will be run
LOG.info("Set the environment for the application master");
Map<String, String> env = new HashMap<String, String>();
// put location of shell script into env
// using the env info, the application master will create the correct local resource for the
// eventual containers that will be launched to execute the shell scripts
env.put(DSConstants.DISTRIBUTEDSHELLSCRIPTLOCATION, hdfsShellScriptLocation);
env.put(DSConstants.DISTRIBUTEDSHELLSCRIPTTIMESTAMP, Long.toString(hdfsShellScriptTimestamp));
env.put(DSConstants.DISTRIBUTEDSHELLSCRIPTLEN, Long.toString(hdfsShellScriptLen));
// Add AppMaster.jar location to classpath
// At some point we should not be required to add
// the hadoop specific classpaths to the env.
// It should be provided out of the box.
// For now setting all required classpaths including
// the classpath to "." for the application jar
StringBuilder classPathEnv = new StringBuilder(Environment.CLASSPATH.$$())
.append(ApplicationConstants.CLASS_PATH_SEPARATOR).append("./*");
for (String c : conf.getStrings(
YarnConfiguration.YARN_APPLICATION_CLASSPATH,
YarnConfiguration.DEFAULT_YARN_CROSS_PLATFORM_APPLICATION_CLASSPATH)) {
classPathEnv.append(ApplicationConstants.CLASS_PATH_SEPARATOR);
classPathEnv.append(c.trim());
}
classPathEnv.append(ApplicationConstants.CLASS_PATH_SEPARATOR).append(
"./log4j.properties");
// Set the necessary command to execute the application master
Vector<CharSequence> vargs = new Vector<CharSequence>(30);
// Set java executable command
LOG.info("Setting up app master command");
vargs.add(Environment.JAVA_HOME.$$() + "/bin/java");
// Set Xmx based on am memory size
vargs.add("-Xmx" + amMemory + "m");
// Set class name
vargs.add(appMasterMainClass);
// Set params for Application Master
vargs.add("--container_memory " + String.valueOf(containerMemory));
vargs.add("--container_vcores " + String.valueOf(containerVirtualCores));
vargs.add("--num_containers " + String.valueOf(numContainers));
vargs.add("--priority " + String.valueOf(shellCmdPriority));
for (Map.Entry<String, String> entry : shellEnv.entrySet()) {
vargs.add("--shell_env " + entry.getKey() + "=" + entry.getValue());
}
if (debugFlag) {
vargs.add("--debug");
}
vargs.add("1>" + ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/AppMaster.stdout");
vargs.add("2>" + ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/AppMaster.stderr");
// Get final commmand
StringBuilder command = new StringBuilder();
for (CharSequence str : vargs) {
command.append(str).append(" ");
}
LOG.info("Completed setting up app master command " + command.toString());
List<String> commands = new ArrayList<String>();
commands.add(command.toString());
// Set up the container launch context for the application master
ContainerLaunchContext amContainer = ContainerLaunchContext.newInstance(
localResources, env, commands, null, null, null);
// Set up resource type requirements
// For now, both memory and vcores are supported, so we set memory and
// vcores requirements
Resource capability = Resource.newInstance(amMemory, amVCores);
appContext.setResource(capability);
// Service data is a binary blob that can be passed to the application
// Not needed in this scenario
// amContainer.setServiceData(serviceData);
// Setup security tokens
if (UserGroupInformation.isSecurityEnabled()) {
// Note: Credentials class is marked as LimitedPrivate for HDFS and MapReduce
Credentials credentials = new Credentials();
String tokenRenewer = conf.get(YarnConfiguration.RM_PRINCIPAL);
if (tokenRenewer == null | | tokenRenewer.length() == 0) {
throw new IOException(
"Can't get Master Kerberos principal for the RM to use as renewer");
}
// For now, only getting tokens for the default file-system.
final Token<?> tokens[] =
fs.addDelegationTokens(tokenRenewer, credentials);
if (tokens != null) {
for (Token<?> token : tokens) {
LOG.info("Got dt for " + fs.getUri() + "; " + token);
}
}
DataOutputBuffer dob = new DataOutputBuffer();
credentials.writeTokenStorageToStream(dob);
ByteBuffer fsTokens = ByteBuffer.wrap(dob.getData(), 0, dob.getLength());
amContainer.setTokens(fsTokens);
}
appContext.setAMContainerSpec(amContainer);
```
* After the setup process is complete, the client is ready to submit the application with specified priority and queue.
```java
// Set the priority for the application master
Priority pri = Priority.newInstance(amPriority);
appContext.setPriority(pri);
// Set the queue to which this application is to be submitted in the RM
appContext.setQueue(amQueue);
// Submit the application to the applications manager
// SubmitApplicationResponse submitResp = applicationsManager.submitApplication(appRequest);
yarnClient.submitApplication(appContext);
```
* At this point, the RM will have accepted the application and in the background, will go through the process of allocating a container with the required specifications and then eventually setting up and launching the AM on the allocated container.
* There are multiple ways a client can track progress of the actual task.
> * It can communicate with the RM and request for a report of the application via the `getApplicationReport()` method of `YarnClient`.
```java
// Get application report for the appId we are interested in
ApplicationReport report = yarnClient.getApplicationReport(appId);
```
> The ApplicationReport received from the RM consists of the following:
>> * *General application information*: Application id, queue to which the application was submitted, user who submitted the application and the start time for the application.
>> * *ApplicationMaster details*: the host on which the AM is running, the rpc port (if any) on which it is listening for requests from clients and a token that the client needs to communicate with the AM.
>> * *Application tracking information*: If the application supports some form of progress tracking, it can set a tracking url which is available via `ApplicationReport`'s `getTrackingUrl()` method that a client can look at to monitor progress.
>> * *Application status*: The state of the application as seen by the ResourceManager is available via `ApplicationReport#getYarnApplicationState`. If the `YarnApplicationState` is set to `FINISHED`, the client should refer to `ApplicationReport#getFinalApplicationStatus` to check for the actual success/failure of the application task itself. In case of failures, `ApplicationReport#getDiagnostics` may be useful to shed some more light on the the failure.
> * If the ApplicationMaster supports it, a client can directly query the AM itself for progress updates via the host:rpcport information obtained from the application report. It can also use the tracking url obtained from the report if available.
* In certain situations, if the application is taking too long or due to other factors, the client may wish to kill the application. `YarnClient` supports the `killApplication` call that allows a client to send a kill signal to the AM via the ResourceManager. An ApplicationMaster if so designed may also support an abort call via its rpc layer that a client may be able to leverage.
yarnClient.killApplication(appId);
### Writing an ApplicationMaster (AM)
* The AM is the actual owner of the job. It will be launched by the RM and via the client will be provided all the necessary information and resources about the job that it has been tasked with to oversee and complete.
* As the AM is launched within a container that may (likely will) be sharing a physical host with other containers, given the multi-tenancy nature, amongst other issues, it cannot make any assumptions of things like pre-configured ports that it can listen on.
* When the AM starts up, several parameters are made available to it via the environment. These include the `ContainerId` for the AM container, the application submission time and details about the NM (NodeManager) host running the ApplicationMaster. Ref `ApplicationConstants` for parameter names.
* All interactions with the RM require an `ApplicationAttemptId` (there can be multiple attempts per application in case of failures). The `ApplicationAttemptId` can be obtained from the AM's container id. There are helper APIs to convert the value obtained from the environment into objects.
```java
Map<String, String> envs = System.getenv();
String containerIdString =
envs.get(ApplicationConstants.AM_CONTAINER_ID_ENV);
if (containerIdString == null) {
// container id should always be set in the env by the framework
throw new IllegalArgumentException(
"ContainerId not set in the environment");
}
ContainerId containerId = ConverterUtils.toContainerId(containerIdString);
ApplicationAttemptId appAttemptID = containerId.getApplicationAttemptId();
```
* After an AM has initialized itself completely, we can start the two clients: one to ResourceManager, and one to NodeManagers. We set them up with our customized event handler, and we will talk about those event handlers in detail later in this article.
```java
AMRMClientAsync.CallbackHandler allocListener = new RMCallbackHandler();
amRMClient = AMRMClientAsync.createAMRMClientAsync(1000, allocListener);
amRMClient.init(conf);
amRMClient.start();
containerListener = createNMCallbackHandler();
nmClientAsync = new NMClientAsyncImpl(containerListener);
nmClientAsync.init(conf);
nmClientAsync.start();
```
* The AM has to emit heartbeats to the RM to keep it informed that the AM is alive and still running. The timeout expiry interval at the RM is defined by a config setting accessible via `YarnConfiguration.RM_AM_EXPIRY_INTERVAL_MS` with the default being defined by `YarnConfiguration.DEFAULT_RM_AM_EXPIRY_INTERVAL_MS`. The ApplicationMaster needs to register itself with the ResourceManager to start hearbeating.
```java
// Register self with ResourceManager
// This will start heartbeating to the RM
appMasterHostname = NetUtils.getHostname();
RegisterApplicationMasterResponse response = amRMClient
.registerApplicationMaster(appMasterHostname, appMasterRpcPort,
appMasterTrackingUrl);
```
* In the response of the registration, maximum resource capability if included. You may want to use this to check the application's request.
```java
// Dump out information about cluster capability as seen by the
// resource manager
int maxMem = response.getMaximumResourceCapability().getMemory();
LOG.info("Max mem capabililty of resources in this cluster " + maxMem);
int maxVCores = response.getMaximumResourceCapability().getVirtualCores();
LOG.info("Max vcores capabililty of resources in this cluster " + maxVCores);
// A resource ask cannot exceed the max.
if (containerMemory > maxMem) {
LOG.info("Container memory specified above max threshold of cluster."
+ " Using max value." + ", specified=" + containerMemory + ", max="
+ maxMem);
containerMemory = maxMem;
}
if (containerVirtualCores > maxVCores) {
LOG.info("Container virtual cores specified above max threshold of cluster."
+ " Using max value." + ", specified=" + containerVirtualCores + ", max="
+ maxVCores);
containerVirtualCores = maxVCores;
}
List<Container> previousAMRunningContainers =
response.getContainersFromPreviousAttempts();
LOG.info("Received " + previousAMRunningContainers.size()
+ " previous AM's running containers on AM registration.");
```
* Based on the task requirements, the AM can ask for a set of containers to run its tasks on. We can now calculate how many containers we need, and request those many containers.
```java
List<Container> previousAMRunningContainers =
response.getContainersFromPreviousAttempts();
List<Container> previousAMRunningContainers =
response.getContainersFromPreviousAttempts();
LOG.info("Received " + previousAMRunningContainers.size()
+ " previous AM's running containers on AM registration.");
int numTotalContainersToRequest =
numTotalContainers - previousAMRunningContainers.size();
// Setup ask for containers from RM
// Send request for containers to RM
// Until we get our fully allocated quota, we keep on polling RM for
// containers
// Keep looping until all the containers are launched and shell script
// executed on them ( regardless of success/failure).
for (int i = 0; i < numTotalContainersToRequest; ++i) {
ContainerRequest containerAsk = setupContainerAskForRM();
amRMClient.addContainerRequest(containerAsk);
}
```
* In `setupContainerAskForRM()`, the follow two things need some set up:
> * Resource capability: Currently, YARN supports memory based resource requirements so the request should define how much memory is needed. The value is defined in MB and has to less than the max capability of the cluster and an exact multiple of the min capability. Memory resources correspond to physical memory limits imposed on the task containers. It will also support computation based resource (vCore), as shown in the code.
> * Priority: When asking for sets of containers, an AM may define different priorities to each set. For example, the Map-Reduce AM may assign a higher priority to containers needed for the Map tasks and a lower priority for the Reduce tasks' containers.
```java
private ContainerRequest setupContainerAskForRM() {
// setup requirements for hosts
// using * as any host will do for the distributed shell app
// set the priority for the request
Priority pri = Priority.newInstance(requestPriority);
// Set up resource type requirements
// For now, memory and CPU are supported so we set memory and cpu requirements
Resource capability = Resource.newInstance(containerMemory,
containerVirtualCores);
ContainerRequest request = new ContainerRequest(capability, null, null,
pri);
LOG.info("Requested container ask: " + request.toString());
return request;
}
```
* After container allocation requests have been sent by the application manager, contailers will be launched asynchronously, by the event handler of the `AMRMClientAsync` client. The handler should implement `AMRMClientAsync.CallbackHandler` interface.
> * When there are containers allocated, the handler sets up a thread that runs the code to launch containers. Here we use the name `LaunchContainerRunnable` to demonstrate. We will talk about the `LaunchContainerRunnable` class in the following part of this article.
```java
@Override
public void onContainersAllocated(List<Container> allocatedContainers) {
LOG.info("Got response from RM for container ask, allocatedCnt="
+ allocatedContainers.size());
numAllocatedContainers.addAndGet(allocatedContainers.size());
for (Container allocatedContainer : allocatedContainers) {
LaunchContainerRunnable runnableLaunchContainer =
new LaunchContainerRunnable(allocatedContainer, containerListener);
Thread launchThread = new Thread(runnableLaunchContainer);
// launch and start the container on a separate thread to keep
// the main thread unblocked
// as all containers may not be allocated at one go.
launchThreads.add(launchThread);
launchThread.start();
}
}
```
> * On heart beat, the event handler reports the progress of the application.
```java
@Override
public float getProgress() {
// set progress to deliver to RM on next heartbeat
float progress = (float) numCompletedContainers.get()
/ numTotalContainers;
return progress;
}
```
* The container launch thread actually launches the containers on NMs. After a container has been allocated to the AM, it needs to follow a similar process that the client followed in setting up the `ContainerLaunchContext` for the eventual task that is going to be running on the allocated Container. Once the `ContainerLaunchContext` is defined, the AM can start it through the `NMClientAsync`.
```java
// Set the necessary command to execute on the allocated container
Vector<CharSequence> vargs = new Vector<CharSequence>(5);
// Set executable command
vargs.add(shellCommand);
// Set shell script path
if (!scriptPath.isEmpty()) {
vargs.add(Shell.WINDOWS ? ExecBatScripStringtPath
: ExecShellStringPath);
}
// Set args for the shell command if any
vargs.add(shellArgs);
// Add log redirect params
vargs.add("1>" + ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/stdout");
vargs.add("2>" + ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/stderr");
// Get final commmand
StringBuilder command = new StringBuilder();
for (CharSequence str : vargs) {
command.append(str).append(" ");
}
List<String> commands = new ArrayList<String>();
commands.add(command.toString());
// Set up ContainerLaunchContext, setting local resource, environment,
// command and token for constructor.
// Note for tokens: Set up tokens for the container too. Today, for normal
// shell commands, the container in distribute-shell doesn't need any
// tokens. We are populating them mainly for NodeManagers to be able to
// download anyfiles in the distributed file-system. The tokens are
// otherwise also useful in cases, for e.g., when one is running a
// "hadoop dfs" command inside the distributed shell.
ContainerLaunchContext ctx = ContainerLaunchContext.newInstance(
localResources, shellEnv, commands, null, allTokens.duplicate(), null);
containerListener.addContainer(container.getId(), container);
nmClientAsync.startContainerAsync(container, ctx);
```
* The `NMClientAsync` object, together with its event handler, handles container events. Including container start, stop, status update, and occurs an error.
* After the ApplicationMaster determines the work is done, it needs to unregister itself through the AM-RM client, and then stops the client.
```java
try {
amRMClient.unregisterApplicationMaster(appStatus, appMessage, null);
} catch (YarnException ex) {
LOG.error("Failed to unregister application", ex);
} catch (IOException e) {
LOG.error("Failed to unregister application", e);
}
amRMClient.stop();
```
FAQ
---
### How can I distribute my application's jars to all of the nodes in the YARN cluster that need it?
You can use the LocalResource to add resources to your application request. This will cause YARN to distribute the resource to the ApplicationMaster node. If the resource is a tgz, zip, or jar - you can have YARN unzip it. Then, all you need to do is add the unzipped folder to your classpath. For example, when creating your application request:
```java
File packageFile = new File(packagePath);
Url packageUrl = ConverterUtils.getYarnUrlFromPath(
FileContext.getFileContext.makeQualified(new Path(packagePath)));
packageResource.setResource(packageUrl);
packageResource.setSize(packageFile.length());
packageResource.setTimestamp(packageFile.lastModified());
packageResource.setType(LocalResourceType.ARCHIVE);
packageResource.setVisibility(LocalResourceVisibility.APPLICATION);
resource.setMemory(memory);
containerCtx.setResource(resource);
containerCtx.setCommands(ImmutableList.of(
"java -cp './package/*' some.class.to.Run "
+ "1>" + ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/stdout "
+ "2>" + ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/stderr"));
containerCtx.setLocalResources(
Collections.singletonMap("package", packageResource));
appCtx.setApplicationId(appId);
appCtx.setUser(user.getShortUserName);
appCtx.setAMContainerSpec(containerCtx);
yarnClient.submitApplication(appCtx);
```
As you can see, the `setLocalResources` command takes a map of names to resources. The name becomes a sym link in your application's cwd, so you can just refer to the artifacts inside by using ./package/\*.
**Note**: Java's classpath (cp) argument is VERY sensitive. Make sure you get the syntax EXACTLY correct.
Once your package is distributed to your AM, you'll need to follow the same process whenever your AM starts a new container (assuming you want the resources to be sent to your container). The code for this is the same. You just need to make sure that you give your AM the package path (either HDFS, or local), so that it can send the resource URL along with the container ctx.
### How do I get the ApplicationMaster's `ApplicationAttemptId`?
The `ApplicationAttemptId` will be passed to the AM via the environment and the value from the environment can be converted into an `ApplicationAttemptId` object via the ConverterUtils helper function.
### Why my container is killed by the NodeManager?
This is likely due to high memory usage exceeding your requested container memory size. There are a number of reasons that can cause this. First, look at the process tree that the NodeManager dumps when it kills your container. The two things you're interested in are physical memory and virtual memory. If you have exceeded physical memory limits your app is using too much physical memory. If you're running a Java app, you can use -hprof to look at what is taking up space in the heap. If you have exceeded virtual memory, you may need to increase the value of the the cluster-wide configuration variable `yarn.nodemanager.vmem-pmem-ratio`.
### How do I include native libraries?
Setting `-Djava.library.path` on the command line while launching a container can cause native libraries used by Hadoop to not be loaded correctly and can result in errors. It is cleaner to use `LD_LIBRARY_PATH` instead.
Useful Links
------------
* [YARN Architecture](http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html)
* [YARN Capacity Scheduler](http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html)
* [YARN Fair Scheduler](http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html)
Sample Code
-----------
Yarn distributed shell: in `hadoop-yarn-applications-distributedshell` project after you set up your development environment.

View File

@ -0,0 +1,42 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
Apache Hadoop NextGen MapReduce (YARN)
==================
MapReduce has undergone a complete overhaul in hadoop-0.23 and we now have, what we call, MapReduce 2.0 (MRv2) or YARN.
The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager (*RM*) and per-application ApplicationMaster (*AM*). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs.
The ResourceManager and per-node slave, the NodeManager (*NM*), form the data-computation framework. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system.
The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.
![MapReduce NextGen Architecture](./yarn_architecture.gif)
The ResourceManager has two main components: Scheduler and ApplicationsManager.
The Scheduler is responsible for allocating resources to the various running applications subject to familiar constraints of capacities, queues etc. The Scheduler is pure scheduler in the sense that it performs no monitoring or tracking of status for the application. Also, it offers no guarantees about restarting failed tasks either due to application failure or hardware failures. The Scheduler performs its scheduling function based the resource requirements of the applications; it does so based on the abstract notion of a resource *Container* which incorporates elements such as memory, cpu, disk, network etc. In the first version, only `memory` is supported.
The Scheduler has a pluggable policy plug-in, which is responsible for partitioning the cluster resources among the various queues, applications etc. The current Map-Reduce schedulers such as the CapacityScheduler and the FairScheduler would be some examples of the plug-in.
The CapacityScheduler supports `hierarchical queues` to allow for more predictable sharing of cluster resources
The ApplicationsManager is responsible for accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure.
The NodeManager is the per-machine framework agent who is responsible for containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager/Scheduler.
The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, tracking their status and monitoring for progress.
MRV2 maintains **API compatibility** with previous stable release (hadoop-1.x). This means that all Map-Reduce jobs should still run unchanged on top of MRv2 with just a recompile.

View File

@ -0,0 +1,272 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
YARN Commands
=============
* [Overview](#Overview)
* [User Commands](#User_Commands)
* [application](#application)
* [applicationattempt](#applicationattempt)
* [classpath](#classpath)
* [container](#container)
* [jar](#jar)
* [logs](#logs)
* [node](#node)
* [queue](#queue)
* [version](#version)
* [Administration Commands](#Administration_Commands)
* [daemonlog](#daemonlog)
* [nodemanager](#nodemanager)
* [proxyserver](#proxyserver)
* [resourcemanager](#resourcemanager)
* [rmadmin](#rmadmin)
* [scmadmin](#scmadmin)
* [sharedcachemanager](#sharedcachemanager)
* [timelineserver](#timelineserver)
* [Files](#Files)
* [etc/hadoop/hadoop-env.sh](#etchadoophadoop-env.sh)
* [etc/hadoop/yarn-env.sh](#etchadoopyarn-env.sh)
* [etc/hadoop/hadoop-user-functions.sh](#etchadoophadoop-user-functions.sh)
* [~/.hadooprc](#a.hadooprc)
Overview
--------
YARN commands are invoked by the bin/yarn script. Running the yarn script without any arguments prints the description for all commands.
Usage: `yarn [SHELL_OPTIONS] COMMAND [GENERIC_OPTIONS] [COMMAND_OPTIONS]`
YARN has an option parsing framework that employs parsing generic options as well as running classes.
| COMMAND\_OPTIONS | Description |
|:---- |:---- |
| SHELL\_OPTIONS | The common set of shell options. These are documented on the [Commands Manual](../../hadoop-project-dist/hadoop-common/CommandsManual.html#Shell_Options) page. |
| GENERIC\_OPTIONS | The common set of options supported by multiple commands. See the Hadoop [Commands Manual](../../hadoop-project-dist/hadoop-common/CommandsManual.html#Generic_Options) for more information. |
| COMMAND COMMAND\_OPTIONS | Various commands with their options are described in the following sections. The commands have been grouped into [User Commands](#User_Commands) and [Administration Commands](#Administration_Commands). |
User Commands
-------------
Commands useful for users of a Hadoop cluster.
### `application`
Usage: `yarn application [options] `
| COMMAND\_OPTIONS | Description |
|:---- |:---- |
| -appStates States | Works with -list to filter applications based on input comma-separated list of application states. The valid application state can be one of the following:  ALL, NEW, NEW\_SAVING, SUBMITTED, ACCEPTED, RUNNING, FINISHED, FAILED, KILLED |
| -appTypes Types | Works with -list to filter applications based on input comma-separated list of application types. |
| -list | Lists applications from the RM. Supports optional use of -appTypes to filter applications based on application type, and -appStates to filter applications based on application state. |
| -kill ApplicationId | Kills the application. |
| -status ApplicationId | Prints the status of the application. |
Prints application(s) report/kill application
### `applicationattempt`
Usage: `yarn applicationattempt [options] `
| COMMAND\_OPTIONS | Description |
|:---- |:---- |
| -help | Help |
| -list ApplicationId | Lists applications attempts from the RM |
| -status Application Attempt Id | Prints the status of the application attempt. |
prints applicationattempt(s) report
### `classpath`
Usage: `yarn classpath`
Prints the class path needed to get the Hadoop jar and the required libraries
### `container`
Usage: `yarn container [options] `
| COMMAND\_OPTIONS | Description |
|:---- |:---- |
| -help | Help |
| -list ApplicationId | Lists containers for the application attempt. |
| -status ContainerId | Prints the status of the container. |
prints container(s) report
### `jar`
Usage: `yarn jar <jar> [mainClass] args... `
Runs a jar file. Users can bundle their YARN code in a jar file and execute it using this command.
### `logs`
Usage: `yarn logs -applicationId <application ID> [options] `
| COMMAND\_OPTIONS | Description |
|:---- |:---- |
| -applicationId \<application ID\> | Specifies an application id |
| -appOwner AppOwner | AppOwner (assumed to be current user if not specified) |
| -containerId ContainerId | ContainerId (must be specified if node address is specified) |
| -help | Help |
| -nodeAddress NodeAddress | NodeAddress in the format nodename:port (must be specified if container id is specified) |
Dump the container logs
### `node`
Usage: `yarn node [options] `
| COMMAND\_OPTIONS | Description |
|:---- |:---- |
| -all | Works with -list to list all nodes. |
| -list | Lists all running nodes. Supports optional use of -states to filter nodes based on node state, and -all to list all nodes. |
| -states States | Works with -list to filter nodes based on input comma-separated list of node states. |
| -status NodeId | Prints the status report of the node. |
Prints node report(s)
### `queue`
Usage: `yarn queue [options] `
| COMMAND\_OPTIONS | Description |
|:---- |:---- |
| -help | Help |
| -status QueueName | Prints the status of the queue. |
Prints queue information
### `version`
Usage: `yarn version`
Prints the Hadoop version.
Administration Commands
-----------------------
Commands useful for administrators of a Hadoop cluster.
### `daemonlog`
Usage:
```
yarn daemonlog -getlevel <host:httpport> <classname>
yarn daemonlog -setlevel <host:httpport> <classname> <level>
```
| COMMAND\_OPTIONS | Description |
|:---- |:---- |
| -getlevel `<host:httpport>` `<classname>` | Prints the log level of the log identified by a qualified `<classname>`, in the daemon running at `<host:httpport>`. This command internally connects to `http://<host:httpport>/logLevel?log=<classname>` |
| -setlevel `<host:httpport> <classname> <level>` | Sets the log level of the log identified by a qualified `<classname>` in the daemon running at `<host:httpport>`. This command internally connects to `http://<host:httpport>/logLevel?log=<classname>&level=<level>` |
Get/Set the log level for a Log identified by a qualified class name in the daemon.
Example: `$ bin/yarn daemonlog -setlevel 127.0.0.1:8088 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl DEBUG`
### `nodemanager`
Usage: `yarn nodemanager`
Start the NodeManager
### `proxyserver`
Usage: `yarn proxyserver`
Start the web proxy server
### `resourcemanager`
Usage: `yarn resourcemanager [-format-state-store]`
| COMMAND\_OPTIONS | Description |
|:---- |:---- |
| -format-state-store | Formats the RMStateStore. This will clear the RMStateStore and is useful if past applications are no longer needed. This should be run only when the ResourceManager is not running. |
Start the ResourceManager
### `rmadmin`
Usage:
```
yarn rmadmin [-refreshQueues]
[-refreshNodes]
[-refreshUserToGroupsMapping]
[-refreshSuperUserGroupsConfiguration]
[-refreshAdminAcls]
[-refreshServiceAcl]
[-getGroups [username]]
[-transitionToActive [--forceactive] [--forcemanual] <serviceId>]
[-transitionToStandby [--forcemanual] <serviceId>]
[-failover [--forcefence] [--forceactive] <serviceId1> <serviceId2>]
[-getServiceState <serviceId>]
[-checkHealth <serviceId>]
[-help [cmd]]
```
| COMMAND\_OPTIONS | Description |
|:---- |:---- |
| -refreshQueues | Reload the queues' acls, states and scheduler specific properties. ResourceManager will reload the mapred-queues configuration file. |
| -refreshNodes | Refresh the hosts information at the ResourceManager. |
| -refreshUserToGroupsMappings | Refresh user-to-groups mappings. |
| -refreshSuperUserGroupsConfiguration | Refresh superuser proxy groups mappings. |
| -refreshAdminAcls | Refresh acls for administration of ResourceManager |
| -refreshServiceAcl | Reload the service-level authorization policy file ResourceManager will reload the authorization policy file. |
| -getGroups [username] | Get groups the specified user belongs to. |
| -transitionToActive [--forceactive] [--forcemanual] \<serviceId\> | Transitions the service into Active state. Try to make the target active without checking that there is no active node if the --forceactive option is used. This command can not be used if automatic failover is enabled. Though you can override this by --forcemanual option, you need caution. |
| -transitionToStandby [--forcemanual] \<serviceId\> | Transitions the service into Standby state. This command can not be used if automatic failover is enabled. Though you can override this by --forcemanual option, you need caution. |
| -failover [--forceactive] \<serviceId1\> \<serviceId2\> | Initiate a failover from serviceId1 to serviceId2. Try to failover to the target service even if it is not ready if the --forceactive option is used. This command can not be used if automatic failover is enabled. |
| -getServiceState \<serviceId\> | Returns the state of the service. |
| -checkHealth \<serviceId\> | Requests that the service perform a health check. The RMAdmin tool will exit with a non-zero exit code if the check fails. |
| -help [cmd] | Displays help for the given command or all commands if none is specified. |
Runs ResourceManager admin client
### scmadmin
Usage: `yarn scmadmin [options] `
| COMMAND\_OPTIONS | Description |
|:---- |:---- |
| -help | Help |
| -runCleanerTask | Runs the cleaner task |
Runs Shared Cache Manager admin client
### sharedcachemanager
Usage: `yarn sharedcachemanager`
Start the Shared Cache Manager
### timelineserver
Usage: `yarn timelineserver`
Start the TimeLineServer
Files
-----
| File | Description |
|:---- |:---- |
| etc/hadoop/hadoop-env.sh | This file stores the global settings used by all Hadoop shell commands. |
| etc/hadoop/yarn-env.sh | This file stores overrides used by all YARN shell commands. |
| etc/hadoop/hadoop-user-functions.sh | This file allows for advanced users to override some shell functionality. |
| ~/.hadooprc | This stores the personal environment for an individual user. It is processed after the `hadoop-env.sh`, `hadoop-user-functions.sh`, and `yarn-env.sh` files and can contain the same settings. |

View File

@ -0,0 +1,75 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
MapReduce NextGen aka YARN aka MRv2
===================================
The new architecture introduced in hadoop-0.23, divides the two major functions of the JobTracker: resource management and job life-cycle management into separate components.
The new ResourceManager manages the global assignment of compute resources to applications and the per-application ApplicationMaster manages the applications scheduling and coordination.
An application is either a single job in the sense of classic MapReduce jobs or a DAG of such jobs.
The ResourceManager and per-machine NodeManager daemon, which manages the user processes on that machine, form the computation fabric.
The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.
More details are available in the [Architecture](./YARN.html) document.
Documentation Index
===================
YARN
----
* [YARN Architecture](./YARN.html)
* [Capacity Scheduler](./CapacityScheduler.html)
* [Fair Scheduler](./FairScheduler.html)
* [ResourceManager Restart](./ResourceManagerRestart.htaml)
* [ResourceManager HA](./ResourceManagerHA.html)
* [Web Application Proxy](./WebApplicationProxy.html)
* [YARN Timeline Server](./TimelineServer.html)
* [Writing YARN Applications](./WritingYarnApplications.html)
* [YARN Commands](./YarnCommands.html)
* [Scheduler Load Simulator](#hadoop-slsSchedulerLoadSimulator.html)
* [NodeManager Restart](./NodeManagerRestart.html)
* [DockerContainerExecutor](./DockerContainerExecutor.html)
* [Using CGroups](./NodeManagerCGroups.html)
* [Secure Containers](./SecureContainer.html)
* [Registry](./registry/index.html)
YARN REST APIs
--------------
* [Introduction](./WebServicesIntro.html)
* [Resource Manager](./ResourceManagerRest.html)
* [Node Manager](./NodeManagerRest.html)