<p>This feature is UNSTABLE. As this feature continues to evolve, APIs may not be maintained and functionality may be changed or removed.</p>
<p>Enabling this feature and running runC containers in your cluster has security implications. Given runC’s integration with many powerful kernel features, it is imperative that administrators understand runC security before enabling this feature.</p></section><section>
<h2><aname="Overview"></a>Overview</h2>
<p><aclass="externalLink"href="https://github.com/opencontainers/runc">runC</a> is a CLI tool for spawning and running containers according to the Open Container Initiative (OCI) specification. runC was originally <aclass="externalLink"href="https://www.docker.com/blog/runc/">spun out</a> of the original Docker infrastructure. Together with a rootfs mountpoint that is created via squashFS images, runC enables users to bundle an application together with its preferred execution environment to be executed on a target machine. For more information about the OCI, see their <aclass="externalLink"href="https://www.opencontainers.org/">website</a>.</p>
<p>The Linux Container Executor (LCE) allows the YARN NodeManager to launch YARN containers to run either directly on the host machine, inside of Docker containers, and now inside of runC containers. The application requesting the resources can specify for each container how it should be executed. The LCE also provides enhanced security and is required when deploying a secure cluster. When the LCE launches a YARN container to execute in a runC container, the application can specify the runC image to be used. These runC images can be built from Docker images.</p>
<p>runC containers provide a custom execution environment in which the application’s code runs, isolated from the execution environment of the NodeManager and other applications. These containers can include special libraries needed by the application, and they can have different versions of native tools and libraries including Perl, Python, and Java. runC containers can even run a different flavor of Linux than what is running on the NodeManager.</p>
<p>runC for YARN provides both consistency (all YARN containers will have the same software environment) and isolation (no interference with whatever is installed on the physical machine).</p>
<p>runC support in the LCE is still evolving. To track progress and take a look at the runC design document, check out <aclass="externalLink"href="https://issues.apache.org/jira/browse/YARN-9014">YARN-9014</a>, the umbrella JIRA for runC support improvements.</p></section><section>
<p>The LCE requires that container-executor binary be owned by root:hadoop and have 6050 permissions. In order to launch runC containers, runC must be installed on all NodeManager hosts where runC containers will be launched.</p>
<p>The following properties should be set in yarn-site.xml:</p>
<p>In addition, a container-executor.cfg file must exist and contain settings for the container executor. The file must be owned by root with permissions 0400. The format of the file is the standard Java properties file format, for example</p>
<divclass="source">
<divclass="source">
<pre>`key=value`
</pre></div></div>
<p>The following properties are required to enable runC support:</p>
<tdalign="left"> The Unix group of the NodeManager. It should match the yarn.nodemanager.linux-container-executor.group in the yarn-site.xml file. </td></tr>
</tbody>
</table>
<p>The container-executor.cfg must contain a section to determine the capabilities that containers are allowed. It contains the following properties:</p>
<tableborder="0"class="bodyTable">
<thead>
<trclass="a">
<thalign="left">Configuration Name </th>
<thalign="left"> Description </th></tr>
</thead><tbody>
<trclass="b">
<tdalign="left"><code>module.enabled</code></td>
<tdalign="left"> Must be “true” or “false” to enable or disable launching runC containers respectively. Default value is 0. </td></tr>
<trclass="a">
<tdalign="left"><code>runc.binary</code></td>
<tdalign="left"> The binary used to launch runC containers. /usr/bin/runc by default. </td></tr>
<trclass="b">
<tdalign="left"><code>runc.run-root</code></td>
<tdalign="left"> The directory where all runtime mounts and overlay mounts will be placed. </td></tr>
<tdalign="left"> Comma separated directories that containers are allowed to mount in read-only mode. By default, no directories are allowed to mounted. </td></tr>
<tdalign="left"> Comma separated directories that containers are allowed to mount in read-write mode. By default, no directories are allowed to mounted. </td></tr>
</tbody>
</table>
<p>Please note that if you wish to run runC containers that require access to the YARN local directories, you must add them to the runc.allowed.rw-mounts list.</p>
<p>In addition, containers are not permitted to mount any parent of the container-executor.cfg directory in read-write mode.</p>
<p>The following properties are optional:</p>
<tableborder="0"class="bodyTable">
<thead>
<trclass="a">
<thalign="left">Configuration Name </th>
<thalign="left"> Description </th></tr>
</thead><tbody>
<trclass="b">
<tdalign="left"><code>min.user.id</code></td>
<tdalign="left"> The minimum UID that is allowed to launch applications. The default is no minimum </td></tr>
<trclass="a">
<tdalign="left"><code>banned.users</code></td>
<tdalign="left"> A comma-separated list of usernames who should not be allowed to launch applications. The default setting is: yarn, mapred, hdfs, and bin. </td></tr>
<tdalign="left"> A comma-separated list of usernames who should be allowed to launch applications even if their UIDs are below the configured minimum. If a user appears in allowed.system.users and banned.users, the user will be considered banned. </td></tr>
<tdalign="left"> Must be “true” or “false”. “false” means traffic control commands are disabled. “true” means traffic control commands are allowed. </td></tr>
<p>runC containers are run inside of images that are derived from Docker images. The docker images are transformed into a set of squashFS file images and uploaded into HDFS. In order to work with YARN, there are a few requirements for these Docker images.</p>
<olstyle="list-style-type: decimal">
<li>
<p>The runC container will be explicitly launched with the application owner as the container user. If the application owner is not a valid user in the Docker image, the application will fail. The container user is specified by the user’s UID. If the user’s UID is different between the NodeManager host and the Docker image, the container may be launched as the wrong user or may fail to launch because the UID does not exist. See <ahref="#user-management">User Management in runC Container</a> section for more details.</p>
</li>
<li>
<p>The Docker image must have whatever is expected by the application in order to execute. In the case of Hadoop (MapReduce or Spark), the Docker image must contain the JRE and Hadoop libraries and have the necessary environment variables set: JAVA_HOME, HADOOP_COMMON_PATH, HADOOP_HDFS_HOME, HADOOP_MAPRED_HOME, HADOOP_YARN_HOME, and HADOOP_CONF_DIR. Note that the Java and Hadoop component versions available in the Docker image must be compatible with what’s installed on the cluster and in any other Docker images being used for other tasks of the same job. Otherwise the Hadoop components started in the runC container may be unable to communicate with external Hadoop components.</p>
</li>
<li>
<p><code>/bin/bash</code> must be available inside the image. This is generally true, however, tiny Docker images (eg. ones which use busybox for shell commands) might not have bash installed. In this case, the following error is displayed:</p>
Shell error output: /usr/bin/docker-current: Error response from daemon: oci runtime error: container_linux.go:235: starting container process caused "exec: \"bash\": executable file not found in $PATH".
Shell output: main : command provided 4
</pre></div></div>
</li>
<li>
<p><code>find</code> command must also be available inside the image. Not having <code>find</code> causes this error:</p>
<divclass="source">
<divclass="source">
<pre>Container exited with a non-zero exit code 127. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
/tmp/hadoop-systest/nm-local-dir/usercache/hadoopuser/appcache/application_1561638268473_0017/container_1561638268473_0017_01_000002/launch_container.sh: line 44: find: command not found
</pre></div></div>
</li>
</ol>
<p>If a Docker image has an entry point set, the entry point will be executed with the launch command of the container as its arguments.</p>
<p>The runC images that are derived from Docker images are localized onto the hosts where the runC containers will execute just like any other localized resource would be. Both MapReduce and Spark assume that tasks which take more that 10 minutes to report progress have stalled, so specifying a large image may cause the application to fail if the localization takes too long.</p></section><section>
<h2><aname="Transforming_a_Docker_Image_into_a_runC_Image"></a><ahref="#docker-to-squash"></a>Transforming a Docker Image into a runC Image</h2>
<p>Every Docker image is comprised of 3 things: - A set of layers that create the file system. - A config file that holds information relative to the environment of the image. - A manifest that describes what layers and config are needed for that image.</p>
<p>Together, these 3 pieces combine to create an Open Container Initiative (OCI) compliant image. runC runs on top of OCI-compliant containers, but with a small twist. Each layer that the runC runtime uses is compressed into squashFS file system. The squashFS layers, along with the config, and manifest are uploaded to HDFS along with an <code>image-tag-to-hash mapping</code> file that describes the mapping between image tags and the manifest associated with that image. Getting this all setup is a complicated and tedious process. There is a patch on <aclass="externalLink"href="https://issues.apache.org/jira/browse/YARN-9564">YARN-9564</a> that contains an unofficial Python script named <code>docker-to-squash.py</code> to help out with the conversion process. This tool will take in a Docker image as input, convert all of its layers into squashFS file systems, and upload the squashFS layers, config, and manifest to HDFS underneath the runc-root. It will also create or update the <code>image-tag-to-hash</code> mapping file. Below is an example invocation of the script to upload an image named <code>centos:latest</code> to HDFS with the runC image name <code>centos</code></p>
<p>Before attempting to launch a runC container, make sure that the LCE configuration is working for applications requesting regular YARN containers. If after enabling the LCE one or more NodeManagers fail to start, the cause is most likely that the ownership and/or permissions on the container-executor binary are incorrect. Check the logs to confirm.</p>
<p>In order to run an application in a runC container, set the following environment variables in the application’s environment:</p>
<tdalign="left"> Determines whether an application will be launched in a runC container. If the value is “runc”, the application will be launched in a runC container. Otherwise a regular process tree container will be used. </td></tr>
<tdalign="left"> Adds additional volume mounts to the runC container. The value of the environment variable should be a comma-separated list of mounts. All such mounts must be given as “source:dest:mode” and the mode must be “ro” (read-only) or “rw” (read-write) to specify the type of access being requested. If neither is specified, read-write will be assumed. The requested mounts will be validated by container-executor based on the values set in container-executor.cfg for runc.allowed.ro-mounts and runc.allowed.rw-mounts. </td></tr>
</tbody>
</table>
<p>The first two are required. The remainder can be set as needed. While controlling the container type through environment variables is somewhat less than ideal, it allows applications with no awareness of YARN’s runC support (such as MapReduce and Spark) to nonetheless take advantage of it through their support for configuring the application environment.</p>
<p><b>Note</b> The runtime will not work if you mount anything onto /tmp or /var/tmp in the container.</p>
<p>Once an application has been submitted to be launched in a runC container, the application will behave exactly as any other YARN application. Logs will be aggregated and stored in the relevant history server. The application life cycle will be the same as for a non-runC application.</p></section><section>
<p><b>WARNING</b> Care should be taken when enabling this feature. Enabling access to directories such as, but not limited to, /, /etc, /run, or /home is not advisable and can result in containers negatively impacting the host or leaking sensitive information. <b>WARNING</b></p>
<p>Files and directories from the host are commonly needed within the runC containers, which runC provides through mounts into the container. Examples include localized resources, Apache Hadoop binaries, and sockets.</p>
<p>In order to mount anything into the container, the following must be configured.</p>
<ul>
<li>The administrator must define the volume whitelist in container-executor.cfg by setting <code>runc.allowed.ro-mounts</code> and <code>runc.allowed.rw-mounts</code> to the list of parent directories that are allowed to be mounted.</li>
</ul>
<p>The administrator supplied whitelist is defined as a comma separated list of directories that are allowed to be mounted into containers. The source directory supplied by the user must either match or be a child of the specified directory.</p>
<p>The user supplied mount list is defined as a comma separated list in the form <i>source</i>:<i>destination</i> or <i>source</i>:<i>destination</i>:<i>mode</i>. The source is the file or directory on the host. The destination is the path within the container where the source will be bind mounted. The mode defines the mode the user expects for the mount, which can be ro (read-only) or rw (read-write). If not specified, rw is assumed. The mode may also include a bind propagation option (shared, rshared, slave, rslave, private, or rprivate). In that case, the mode should be of the form <i>option</i>, rw+<i>option</i>, or ro+<i>option</i>.</p>
<p>The following example outlines how to use this feature to mount the commonly needed /sys/fs/cgroup directory into the container running on YARN.</p>
<p>The administrator sets runc.allowed.ro-mounts in container-executor.cfg to “/sys/fs/cgroup”. Applications can now request that “/sys/fs/cgroup” be mounted from the host into the container in read-only mode.</p>
<p>The Nodemanager has the option to setup a default list of read-only or read-write mounts to be added to the container via <code>yarn.nodemanager.runtime.linux.runc.default-ro-mount</code>" and <code>yarn.nodemanager.runtime.linux.runc.default-rw-mounts</code> in yarn-site.xml. In this example, <code>yarn.nodemanager.runtime.linux.runc.default-ro-mounts</code> would be set to <code>/sys/fs/cgroup:/sys/fs/cgroup</code>.</p></section><section>
<h2><aname="User_Management_in_runC_Container"></a><ahref="#user-management"></a>User Management in runC Container</h2>
<p>YARN’s runC container support launches container processes using the uid:gid identity of the user, as defined on the NodeManager host. User and group name mismatches between the NodeManager host and container can lead to permission issues, failed container launches, or even security holes. Centralizing user and group management for both hosts and containers greatly reduces these risks. When running containerized applications on YARN, it is necessary to understand which uid:gid pair will be used to launch the container’s process.</p>
<p>As an example of what is meant by uid:gid pair, consider the following. By default, in non-secure mode, YARN will launch processes as the user <code>nobody</code> (see the table at the bottom of <ahref="./NodeManagerCgroups.html">Using CGroups with YARN</a> for how the run as user is determined in non-secure mode). On CentOS based systems, the <code>nobody</code> user’s uid is <code>99</code> and the <code>nobody</code> group is <code>99</code>. As a result, YARN will invoke runC with uid <code>99</code> and gid <code>99</code>. If the <code>nobody</code> user does not have the uid <code>99</code> in the container, the launch may fail or have unexpected results.</p>
<p>There are many ways to address user and group management. runC, by default, will authenticate users against <code>/etc/passwd</code> (and <code>/etc/shadow</code>) within the container. Using the default <code>/etc/passwd</code> supplied in the runC image is unlikely to contain the appropriate user entries and will result in launch failures. It is highly recommended to centralize user and group management. Several approaches to user and group management are outlined below.</p><section>
<h3><aname="Static_user_management"></a>Static user management</h3>
<p>The most basic approach to managing user and groups is to modify the user and group within the runC image. This approach is only viable in non-secure mode where all container processes will be launched as a single known user, for instance <code>nobody</code>. In this case, the only requirement is that the uid:gid pair of the nobody user and group must match between the host and container. On a CentOS based system, this means that the nobody user in the container needs the UID <code>99</code> and the nobody group in the container needs GID <code>99</code>.</p>
<p>One approach to change the UID and GID is by leveraging <code>usermod</code> and <code>groupmod</code>. The following sets the correct UID and GID for the nobody user/group.</p>
<divclass="source">
<divclass="source">
<pre>usermod -u 99 nobody
groupmod -g 99 nobody
</pre></div></div>
<p>This approach is not recommended beyond testing given the inflexibility to add users.</p></section><section>
<h3><aname="Bind_mounting"></a>Bind mounting</h3>
<p>When organizations already have automation in place to create local users on each system, it may be appropriate to bind mount /etc/passwd and /etc/group into the container as an alternative to modifying the container image directly. To enable the ability to bind mount /etc/passwd and /etc/group, update <code>runc.allowed.ro-mounts</code> in <code>container-executor.cfg</code> to include those paths. For this to work on runC, “yarn.nodemanager.runtime.linux.runc.default-ro-mounts” will need to include <code>/etc/passwd:/etc/passwd:ro</code> and <code>/etc/group:/etc/group:ro</code>.</p>
<p>There are several challenges with this bind mount approach that need to be considered.</p>
<olstyle="list-style-type: decimal">
<li>Any users and groups defined in the image will be overwritten by the host’s users and groups</li>
<li>No users and groups can be added once the container is started, as /etc/passwd and /etc/group are immutible in the container. Do not mount these read-write as it can render the host inoperable.</li>
</ol>
<p>This approach is not recommended beyond testing given the inflexibility to modify running containers.</p></section><section>
<h3><aname="SSSD"></a>SSSD</h3>
<p>An alternative approach that allows for centrally managing users and groups is SSSD. System Security Services Daemon (SSSD) provides access to different identity and authentication providers, such as LDAP or Active Directory.</p>
<p>The traditional schema for Linux authentication is as follows:</p>
<p>We can bind-mount the UNIX sockets SSSD communicates over into the container. This will allow the SSSD client side libraries to authenticate against the SSSD running on the host. As a result, user information does not need to exist in /etc/passwd of the docker image and will instead be serviced by SSSD.</p>
<p>Step by step configuration for host and container:</p>
<olstyle="list-style-type: decimal">
<li>Host config</li>
</ol>
<ul>
<li>Install packages
<divclass="source">
<divclass="source">
<pre># yum -y install sssd-common sssd-proxy
</pre></div></div>
</li>
<li>create a PAM service for the container.
<divclass="source">
<divclass="source">
<pre># cat /etc/pam.d/sss_proxy
auth required pam_unix.so
account required pam_unix.so
password required pam_unix.so
session required pam_unix.so
</pre></div></div>
</li>
<li>create SSSD config file, /etc/sssd/sssd.conf Please note that the permissions must be 0600 and the file must be owned by root:root.
<divclass="source">
<divclass="source">
<pre># cat /etc/sssd/sssd/conf
[sssd]
services = nss,pam
config_file_version = 2
domains = proxy
[nss]
[pam]
[domain/proxy]
id_provider = proxy
proxy_lib_name = files
proxy_pam_target = sss_proxy
</pre></div></div>
</li>
<li>start sssd
<divclass="source">
<divclass="source">
<pre># systemctl start sssd
</pre></div></div>
</li>
<li>verify a user can be retrieved with sssd
<divclass="source">
<divclass="source">
<pre># getent passwd -s sss localuser
</pre></div></div>
</li>
</ul>
<olstyle="list-style-type: decimal">
<li>Container setup</li>
</ol>
<p>It’s important to bind-mount the /var/lib/sss/pipes directory from the host to the container since SSSD UNIX sockets are located there.</p>
<divclass="source">
<divclass="source">
<pre>-v /var/lib/sss/pipes:/var/lib/sss/pipes:rw
</pre></div></div>
<olstyle="list-style-type: decimal">
<li>Container config</li>
</ol>
<p>All the steps below should be executed on the container itself.</p>
<ul>
<li>
<p>Install only the sss client libraries</p>
<divclass="source">
<divclass="source">
<pre># yum -y install sssd-client
</pre></div></div>
</li>
<li>
<p>make sure sss is configured for passwd and group databases in</p>
<divclass="source">
<divclass="source">
<pre>/etc/nsswitch.conf
</pre></div></div>
</li>
<li>
<p>configure the PAM service that the application uses to call into SSSD</p>
<divclass="source">
<divclass="source">
<pre># cat /etc/pam.d/system-auth
#%PAM-1.0
# This file is auto-generated.
# User changes will be destroyed the next time authconfig is run.
<p>This example assumes that Hadoop is installed to <code>/usr/local/hadoop</code>.</p>
<p>You will also need to squashify a Docker image and upload it to HDFS before you can run with that image. See <ahref="#docker-to-squash">Transforming a Docker Image into a runC Image</a> for instructions on how to transform a Docker image into a image that runC can use. For this example, we will assume that you have done with that an image named <code>hadoop-image</code>.</p>
<p>Additionally, <code>runc.allowed.ro-mounts</code> in <code>container-executor.cfg</code> has been updated to include the directories: <code>/usr/local/hadoop,/etc/passwd,/etc/group</code>.</p>
<p>To submit the pi job to run in runC containers, run the following commands:</p>
<p>Note that the application master, map tasks, and reduce tasks are configured independently. In this example, we are using the <code>hadoop-image</code> image for all three.</p></section><section>
<p>This example assumes that Hadoop is installed to <code>/usr/local/hadoop</code> and Spark is installed to <code>/usr/local/spark</code>.</p>
<p>You will also need to squashify a Docker image and upload it to HDFS before you can run with that image. See <ahref="#docker-to-squash">Transforming a Docker Image into a runC Image</a> for instructions on how to transform a Docker image into a image that runC can use. For this example, we will assume that you have done with that an image named <code>hadoop-image</code>.</p>
<p>Additionally, <code>runc.allowed.ro-mounts</code> in <code>container-executor.cfg</code> has been updated to include the directories: <code>/usr/local/hadoop,/etc/passwd,/etc/group</code>.</p>
<p>To run a Spark shell in runC containers, run the following command:</p>
<p>Note that the application master and executors are configured independently. In this example, we are using the <code>hadoop-image</code> image for both.</p></section>