YARN-4653. Document YARN security model from the perspective of Application Developers. Contributed by Steve Loughran

2016-02-14 17:13:15 +08:00 · 2016-02-14 17:13:15 +08:00 · dea90c9a86
commit dea90c9a86
parent ec12ce8f48
3 changed files with 564 additions and 0 deletions
--- a/hadoop-project/src/site/site.xml
+++ b/hadoop-project/src/site/site.xml
@ -126,6 +126,7 @@
      <item name="Web Application Proxy" href="hadoop-yarn/hadoop-yarn-site/WebApplicationProxy.html"/>
      <item name="Timeline Server" href="hadoop-yarn/hadoop-yarn-site/TimelineServer.html"/>
      <item name="Writing YARN Applications" href="hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html"/>
+      <item name="YARN Application Security" href="hadoop-yarn/hadoop-yarn-site/YarnApplicationSecurity.html"/>
      <item name="NodeManager" href="hadoop-yarn/hadoop-yarn-site/NodeManager.html"/>
      <item name="DockerContainerExecutor" href="hadoop-yarn/hadoop-yarn-site/DockerContainerExecutor.html"/>
      <item name="Using CGroups" href="hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html"/>
--- a/hadoop-yarn-project/CHANGES.txt
+++ b/hadoop-yarn-project/CHANGES.txt
@ -1433,6 +1433,9 @@ Release 2.7.3 - UNRELEASED
    YARN-4492. Add documentation for preemption supported in Capacity
    scheduler (Naganarasimha G R via jlowe)

+    YARN-4653. Document YARN security model from the perspective of
+    Application Developers. (Steve Loughran via jianhe)
+
  OPTIMIZATIONS

  BUG FIXES
--- a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/YarnApplicationSecurity.md
+++ b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/YarnApplicationSecurity.md
@ -0,0 +1,560 @@
+<!---
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+# YARN Application Security
+
+Anyone writing a YARN application needs to understand the process, in order
+to write short-lived applications or long-lived services. They also need to
+start testing on secure clusters during early development stages, in order
+to write code that actually works.
+
+## How YARN Security works
+
+YARN Resource Managers (RMs) and Node Managers (NMs) co-operate to execute
+the user's application with the identity and hence access rights of that user.
+
+The (active) Resource Manager:
+
+1. Finds space in a cluster to deploy the core of the application,
+the Application Master (AM).
+
+1. Requests that the NM on that node allocate a container and start the AM in it.
+
+1. Communicates with the AM, so that the AM can request new containers and
+manipulate/release current ones, and to provide notifications about allocated
+and running containers.
+
+The Node Managers:
+
+1. *Localize* resources: Download from HDFS or other filesystem into a local directory. This
+is done using the delegation tokens attached to the container launch context. (For non-HDFS
+resources, using other credentials such as object store login details in cluster configuration
+files)
+
+1. Start the application as the user.
+
+1. Monitor the application and report failure to the RM.
+
+To execute code in the cluster, a YARN application must:
+
+1. Have a client-side application which sets up the `ApplicationSubmissionContext`
+detailing what is to be launched. This includes:
+
+    * A list of files in the cluster's filesystem to be "localized".
+    * The environment variables to set in the container.
+    * The commands to execute in the container to start the application.
+    * Any security credentials needed by YARN to launch the application.
+    * Any security credentials needed by the application to interact
+    with any Hadoop cluster services and applications.
+
+1. Have an Application Master which, when launched, registers with
+the YARN RM and listens for events. Any AM which wishes to execute work in
+other containers must request them off the RM, and, when allocated, create
+a `ContainerLaunchContext` containing the command to execute, the
+environment to execute the command, binaries to localize and all relevant
+security credentials.
+
+1. Even with the NM handling the localization process, the AM must itself
+be able to retrieve the security credentials supplied at launch time so
+that it itself may work with HDFS and any other services, and to pass some or
+all of these credentials down to the launched containers.
+
+### Acquiring and Adding tokens to a YARN Application
+
+The delegation tokens which a YARN application needs must be acquired
+from a program executing as an authenticated user. For a YARN application,
+this means the user launching the application. It is the client-side part
+of the YARN application which must do this:
+
+1. Log in via `UserGroupInformation`.
+1. Identify all tokens which must be acquired.
+1. Request these tokens from the specific Hadoop services.
+1. Marshall all tokens into a byte buffer.
+1. Add them to the `ContainerLaunchContext` within the `ApplicationSubmissionContext`.
+
+Which tokens are required? Normally, at least a token to access HDFS.
+
+An application must request a delegation token from every filesystem with
+which it intends to interact —including the cluster's main FS.
+`FileSystem.addDelegationTokens(renewer, credentials)` can be used to collect these;
+it is a no-op on those filesystems which do not issue tokens (including
+non-kerberized HDFS clusters).
+
+Applications talking to other services, such as Apache HBase and Apache Hive,
+must request tokens from these services, using the libraries of these
+services to acquire delegation tokens. All tokens can be added to the same
+set of credentials, then saved to a byte buffer for submission.
+
+The Application Timeline Server also needs a delegation token. This is handled
+automatically on AM launch.
+
+### Extracting tokens within the AM
+
+When the Application Master is launched and any of the UGI/Hadoop operations
+which trigger a user login invoked, the UGI class will automatically load in all tokens
+saved in the file named by the environment variable `HADOOP_TOKEN_FILE_LOCATION`.
+
+This happens on an insecure cluster along with a secure one, and on a secure
+cluster even if a keytab is used by the application. Why? Because the
+AM/RM token needed to authenticate the application with the YARN RM is always
+supplied this way.
+
+This means you have a relative similar workflow across secure and insecure clusters.
+
+1. Suring AM startup, log in to Kerberos.
+A call to `UserGroupInformation.isSecurityEnabled()` will trigger this operation.
+
+1. Enumerate the current user's credentials, through a call of
+`UserGroupInformation.getCurrentUser().getCredentials()`.
+
+1. Filter out the AMRM token, resulting in a new set of credentials. In an
+insecure cluster, the list of credentials will now be empty; in a secure cluster
+they will contain
+
+1. Set the credentials of all containers to be launched to this (possibly empty)
+list of credentials.
+
+1. If the filtered list of tokens to renew, is non-empty start up a thread
+to renew them.
+
+### Token Renewal
+
+Tokens *expire*: they have a limited lifespan. An application wishing to
+use a token past this expiry date must *renew* the token before the token
+expires.
+
+Hadoop automatically sets up a delegation token renewal thread when needed,
+the `DelegationTokenRenewer`.
+
+It is the responsibility of the application to renew all tokens other
+than the AMRM and timeline tokens.
+
+Here are the different strategies
+
+1. Don't. Rely on the lifespan of the application being so short that token
+renewal is not needed. For applications whose life can always be measured
+in minutes or tens of minutes, this is a viable strategy.
+
+1. Start a background thread/Executor to renew the tokens at a regular interval.
+This what most YARN applications do.
+
+## Other Aspects of YARN Security
+
+
+### AM/RM Token Refresh
+
+The AM/RM token is renewed automatically; the AM pushes out a new token
+to the AM within an `allocate` message. Consult the `AMRMClientImpl` class
+to see the process. *Your AM code does not need to worry about this process*
+
+### Token Renewal on AM Restart
+
+Even if an application is renewing tokens regularly, if an AM fails and is
+restarted, it gets restarted from that original
+`ApplicationSubmissionContext`. The tokens there may have expired, so localization
+may fail, even before the issue of credentials to talk to other services.
+
+How is this problem addressed? The YARN Resource Manager gets a new token
+for the node managers, if needed.
+
+More precisely
+
+1. The token passed by the RM to the NM for localization is refreshed/updated as needed.
+1. Tokens in the app launch context for use by the application are *not* refreshed.
+That is, if it has an out of date HDFS token —that token is not renewed. This
+also holds for tokens for for Hive, HBase, etc.
+1. Therefore, to survive AM restart after token expiry, your AM has to get the
+NMs to localize the keytab or make no HDFS accesses until (somehow) a new token has been passed to them from a client.
+
+This is primarily an issue for long-lived services (see below).
+
+### Unmanaged Application Masters
+
+Unmanaged application masters are not launched in a container set up by
+the RM and NM, so cannot automatically pick up an AM/RM token at launch time.
+The `YarnClient.getAMRMToken()` API permits an Unmanaged AM to request an AM/RM
+token. Consult `UnmanagedAMLauncher` for the specifics.
+
+### Identity on an insecure cluster: `HADOOP_USER_NAME`
+
+In an insecure cluster, the application will run as the identity of
+the account of the node manager, typically something such as `yarn`
+or `mapred`. By default, the application will access HDFS
+as that user, with a different home directory, and with
+a different user identified in audit logs and on file system owner attributes.
+
+This can be avoided by having the client identify the identify of the
+HDFS/Hadoop user under which the application is expected to run. *This
+does not affect the OS-level user or the application's access rights
+to the local machine*.
+
+When Kerberos is disabled, the identity of a user is picked up
+by Hadoop first from the environment variable `HADOOP_USER_NAME`,
+then from the OS-level username (e.g. the system property `user.name`).
+
+YARN applications should propagate the user name of the user launching
+an application by setting this environment variable.
+
+```java
+Map<String, String> env = new HashMap<>();
+String userName = UserGroupInformation.getCurrentUser().getUserName();
+env.put(UserGroupInformation.HADOOP_USER_NAME, userName);
+containerLaunchContext.setEnvironment(env);
+```
+
+Note that this environment variable is picked up in all applications
+which talk to HDFS via the hadoop libraries. That is, if set, it
+is the identity picked up by HBase and other applications executed
+within the environment of a YARN container within which this environment
+variable is set.
+
+### Oozie integration and `HADOOP_TOKEN_FILE_LOCATION`
+
+Apache Oozie can launch an application in a secure cluster either by acquiring
+all relevant credentials, saving them to a file in the local filesystem,
+then setting the path to this file in the environment variable
+`HADOOP_TOKEN_FILE_LOCATION`. This is of course the same environment variable
+passed down by YARN in launched containers, as is similar content: a byte
+array with credentials.
+
+Here, however, the environment variable is set in the environment
+executing the YARN client. This client must use the token information saved
+in the named file *instead of acquiring any tokens of its own*.
+
+Loading in the token file is automatic: UGI does it during user login.
+
+The client is then responsible for passing the same credentials into the
+AM launch context. This can be done simply by passing down the current
+credentials.
+
+```java
+credentials = new Credentials(
+    UserGroupInformation.getCurrentUser().getCredentials());
+```
+
+### Timeline Server integration
+
+The [Application Timeline Server](TimelineServer.html) can be deployed as a secure service
+—in which case the application will need the relevant token to authenticate with
+it. This process is handled automatically in `YarnClientImpl` if ATS is
+enabled in a secure cluster. Similarly, the AM-side `TimelineClient` YARN service
+class manages token renewal automatically via the ATS's SPNEGO-authenticated REST API.
+
+If you need to prepare a set of delegation tokens for a YARN application launch
+via Oozie, this can be done via the timeline client API.
+
+```java
+try(TimelineClient timelineClient = TimelineClient.createTimelineClient()) {
+  timelineClient.init(conf);
+  timelineClient.start();
+  Token<TimelineDelegationTokenIdentifier> token =
+      timelineClient.getDelegationToken(rmprincipal));
+  credentials.addToken(token.getService(), token);
+}
+```
+
+### Cancelling Tokens
+
+Applications *may* wish to cancel tokens they hold when terminating their AM.
+This ensures that the tokens are no-longer valid.
+
+This is not mandatory, and as a clean shutdown of a YARN application cannot
+be guaranteed, it is not possible to guarantee that the tokens will always
+be during application termination. However, it does reduce the window of
+vulnerability to stolen tokens.
+
+## Securing Long-lived YARN Services
+
+There is a time limit on all token renewals, after which tokens won't renew,
+causing the application to stop working. This is somewhere between seventy-two
+hours and seven days.
+
+Any YARN service intended to run for an extended period of time *must* have
+a strategy for renewing credentials.
+
+Here are the strategies:
+
+### Pre-installed Keytabs for AM and containers
+
+A keytab is provided for the application's use on every node.
+
+This is done by:
+
+1. Installing it in every cluster node's local filesystem.
+1. Providing the path to this in a configuration option.
+1. The application loading the credentials via
+  `UserGroupInformation.loginUserFromKeytab()`.
+
+The keytab must be in a secure directory path, where
+only the service (and other trusted accounts) can read it. Distribution
+becomes a responsibility of the cluster operations team.
+
+This is effectively how all static Hadoop applications get their security credentials.
+
+### Keytabs for AM and containers distributed via YARN
+
+
+1. A keytab is uploaded to HDFS.
+
+1. When launching the AM, the keytab is listed as a resource to localize to
+the AM's container.
+
+1. The Application Master is configured with the relative path to the keytab,
+and logs in with `UserGroupInformation.loginUserFromKeytab()`.
+
+1. When the AM launches the container, it lists the HDFS path to the keytab
+as a resource to localize.
+
+1. It adds the HDFS delegation token to the container launch context, so
+that the keytab and other application files can be localized.
+
+1. Launched containers must themselves log in via
+  `UserGroupInformation.loginUserFromKeytab()`. UGI handles the login, and
+  schedules a background thread to relogin the user periodically.
+
+1. Token creation is handled automatically in the Hadoop IPC and REST APIs,
+the containers stay logged in via kerberos for their entire duration.
+
+This avoids the administration task of installing keytabs for specific services
+across the entire cluster.
+
+It does require the client to have access to the keytab
+and, as it is uploaded to the distributed filesystem, must be secured through
+the appropriate path permissions/ACLs.
+
+As all containers have access to the keytab, all code executing in the containers
+has to be trusted. Malicious code (or code escaping some form of sandbox)
+could read the keytab, and hence have access to the cluster until the keys
+expire or are revoked.
+
+This is the strategy implemented by Apache Slider (incubating).
+
+### AM keytab distributed via YARN; AM regenerates delegation tokens for containers.
+
+1. A keytab is uploaded to HDFS by the client.
+
+1. When launching the AM, the keytab is listed as a resource to localize to
+the AM's container.
+
+1. The Application Master is configured with the relative path to the keytab,
+and logs in with `UserGroupInformation.loginUserFromKeytab()`. The UGI
+codepath will still automatically load the file references by
+`$HADOOP_TOKEN_FILE_LOCATION`, which is how the AMRM token is picked up.
+
+1. When the AM launches a container, it acquires all the delegation tokens
+needed by that container, and adds them to the container's container launch context.
+
+1. Launched containers must load the delegation tokens from `$HADOOP_TOKEN_FILE_LOCATION`,
+and use them (including renewals) until they can no longer be renewed.
+
+1. The AM must implement an IPC interface which permits containers to request
+a new set of delegation tokens; this interface must itself use authentication
+and ideally wire encryption.
+
+1. Before a delegation token is due to expire, the processes running in the containers
+must request new tokens from the Application Master over the IPC channel.
+
+1. When the containers need the new tokens, the AM, logged in with a keytab,
+ asks the various cluster services for new tokens.
+
+(Note there is an alternative direction for refresh operations: from AM
+ to the containers, again over whatever IPC channel is implemented between
+ AM and containers). The rest of the algorithm: AM regenerated tokens passed
+ to containers over IPC.
+
+This is the strategy used by Apache Spark 1.5+, with a netty-based protocol
+between containers and the AM for token updates.
+
+Because only the AM has direct access to the keytab, it is less exposed.
+Code running in the containers only has access to the delegation tokens.
+
+However, those containers will have access to HDFS from the tokens
+passed in at container launch, so will have access to the copy of the keytab
+used for launching the AM. While the AM could delete that keytab on launch,
+doing so would stop YARN being able to successfully relaunch the AM after any
+failure.
+
+### Client-side Token Push
+
+This strategy may be the sole one acceptable to a strict operations team: a client process
+running on an account holding a Kerberos TGT negotiates with all needed cluster services
+for new delegation tokens, tokens which are then pushed out to the Application Master via
+some RPC interface.
+
+This does require the client process to be re-executed on a regular basis; a cron or Oozie job
+can do this. The AM will need to implement an IPC API over which renewed
+tokens can be provided. (Note that as Oozie can collect the tokens itself,
+all the updater application needs to do whenever executed is set up an IPC
+connection with the AM and pass up the current user's credentials).
+
+## Securing YARN Application Web UIs and REST APIs
+
+YARN provides a straightforward way of giving every YARN application SPNEGO authenticated
+web pages: it implements SPNEGO authentication in the Resource Manager Proxy.
+
+YARN web UI are expected to load the AM proxy filter when setting up its web UI; this filter
+will redirect all HTTP Requests coming from any host other than the RM Proxy hosts to an
+RM proxy, to which the client app/browser must re-issue the request. The client will authenticate
+against the principal of the RM Proxy (usually `yarn`), and, once authenticated, have its
+request forwared.
+
+As a result, all client interactions are SPNEGO-authenticated, without the YARN application
+itself needing any kerberos principal for the clients to authenticate against.
+
+Known weaknesses in this approach are:
+
+1. As calls coming from the proxy hosts are not redirected, any application running
+on those hosts has unrestricted access to the YARN applications. This is why in a secure cluster
+the proxy hosts *must* run on cluster nodes which do not run end user code (i.e. not run YARN
+NodeManagers and hence schedule YARN containers, nor support logins by end users).
+
+1. The HTTP requests between proxy and YARN RM Server are not currently encrypted.
+That is: HTTPS is not supported.
+
+## Securing YARN Application REST APIs
+
+YARN REST APIs running on the same port as the registered web UI of a YARN application are
+automatically authenticated via SPNEGO authentication in the RM proxy.
+
+Any REST endpoint (and equally, any web UI) brought up on a different port does not
+support SPNEGO authentication unless implemented in the YARN application itself.
+
+## Checklist for YARN Applications
+
+Here is the checklist of core actions which a YARN application must do
+to successfully launch in a YARN cluster.
+
+### Client
+
+`[ ]` Client checks for security being enabled via `UserGroupInformation.isSecurityEnabled()`
+
+In a secure cluster:
+
+`[ ]` If `HADOOP_TOKEN_FILE_LOCATION` is unset, client acquires delegation tokens
+ for the local filesystems, with the RM principal set as the renewer.
+
+`[ ]` If `HADOOP_TOKEN_FILE_LOCATION` is unset, client acquires delegation tokens
+for all other services to be used in the YARN application.
+
+`[ ]` If `HADOOP_TOKEN_FILE_LOCATION` is set, client uses the current user's credentials
+as the source of all tokens to be added to the container launch context.
+
+`[ ]` Client sets all tokens on AM `ContainerLaunchContext.setTokens()`.
+
+`[ ]` Recommended: if it is set in the client's environment,
+client sets the environment variable `HADOOP_JAAS_DEBUG=true`
+in the Container launch context of the AM.
+
+In an insecure cluster:
+
+`[ ]` Propagate local username to YARN AM, hence HDFS identity via the
+`HADOOP_USER_NAME` environment variable.
+
+### App Master
+
+`[ ]` In a secure cluster, AM retrieves security tokens from `HADOOP_TOKEN_FILE_LOCATION`
+environment variable (automatically done by UGI).
+
+`[ ]` A copy the token set is filtered to remove the AM/RM token and any timeline
+token.
+
+`[ ]` A thread or executor is started to renew threads on a regular basis.
+
+`[ ]` Recommended: AM cancels tokens when application completes.
+
+### Container Launch by AM
+
+`[ ]` Tokens to be passed to containers are passed via
+`ContainerLaunchContext.setTokens()`.
+
+`[ ]` In an insecure cluster, propagate the `HADOOP_USER_NAME` environment variable.
+
+`[ ]` Recommended: AM sets the environment variable `HADOOP_JAAS_DEBUG=true`
+in the Container launch context if it is set in the AM's environment.
+
+### Launched Containers
+
+`[ ]` Call `UserGroupInformation.isSecurityEnabled()` to trigger security setup.
+
+`[ ]` A thread or executor is started to renew threads on a regular basis.
+
+### YARN service
+
+`[ ]` Application developers have chosen and implemented a token renewal strategy:
+shared keytab, AM keytab or client-side token refresh.
+
+`[ ]` In a secure cluster, the keytab is either already in HDFS (and checked for),
+or it is in the local FS of the client, in which case it must be uploaded and added to
+the list of resources to localize.
+
+`[ ]` If stored in HDFS, keytab permissions should be checked. If the keytab
+is readable by principals other than the current user, warn,
+and consider actually failing the launch (similar to the normal `ssh` application.)
+
+`[ ]` Client acquires HDFS delegation token and and attaches to the AM Container
+Launch Context,
+
+`[ ]` AM logs in as principal in keytab via `loginUserFromKeytab()`.
+
+`[ ]` (AM extracts AM/RM token from the `HADOOP_TOKEN_FILE_LOCATION` environment
+variable).
+
+`[ ]` For launched containers, either the keytab is propagated, or
+the AM acquires/attaches all required delegation tokens to the Container Launch
+context alongside the HDFS delegation token needed by the NMs.
+
+## Testing YARN applications in a secure cluster.
+
+It is straightforward to be confident that a YARN application works in secure
+cluster. The process to do so is: test on a secure cluster.
+
+Even a single VM-cluster can be set up with security enabled. If doing so,
+we recommend turning security up to its strictest, with SPNEGO-authenticated
+Web UIs (and hence RM Proxy), as well as IPC wire encryption. Setting the
+kerberos token expiry to under an hour will find kerberos expiry problems
+early —so is also recommended.
+
+`[ ]` Application launched in secure cluster.
+
+`[ ]` Launched application runs as user submitting job (tip: log `user.name`
+system property in AM).
+
+`[ ]` Web browser interaction verified in secure cluster.
+
+`[ ]` REST client interation (GET operations) tested.
+
+`[ ]` Application continues to run after Kerberos Token expiry.
+
+`[ ]` Application does not launch if user lacks Kerberos credentials.
+
+`[ ]` If the application supports the timeline server, verify that it publishes
+events in a secure cluster.
+
+`[ ]` If the application integrates with other applications, such as HBase or Hive,
+verify that the interaction works in a secure cluster.
+
+`[ ]` If the application communicates with remote HDFS clusters, verify
+that it can do so in a secure cluster (i.e. that the client extracted any
+delegation tokens for this at launch time)
+
+## Important
+
+*If you don't test your YARN application in a secure Hadoop cluster,
+it won't work.*
+
+And without those tests: *your users will be the ones to find out
+that your application doesn't work in a secure cluster.*
+
+Bear that in mind when considering how much development effort to put into
+Kerberos support.