From 68d98f70baa18cc450021619527b93a98a52371b Mon Sep 17 00:00:00 2001 From: Jian He Date: Sun, 14 Feb 2016 17:13:15 +0800 Subject: [PATCH] YARN-4653. Document YARN security model from the perspective of Application Developers. Contributed by Steve Loughran --- hadoop-project/src/site/site.xml | 1 + hadoop-yarn-project/CHANGES.txt | 3 + .../site/markdown/YarnApplicationSecurity.md | 560 ++++++++++++++++++ 3 files changed, 564 insertions(+) create mode 100644 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/YarnApplicationSecurity.md diff --git a/hadoop-project/src/site/site.xml b/hadoop-project/src/site/site.xml index d4fde5667fe..a22e94cab84 100644 --- a/hadoop-project/src/site/site.xml +++ b/hadoop-project/src/site/site.xml @@ -124,6 +124,7 @@ + diff --git a/hadoop-yarn-project/CHANGES.txt b/hadoop-yarn-project/CHANGES.txt index ecf1efc666b..aaf4b56513e 100644 --- a/hadoop-yarn-project/CHANGES.txt +++ b/hadoop-yarn-project/CHANGES.txt @@ -1378,6 +1378,9 @@ Release 2.7.3 - UNRELEASED YARN-4492. Add documentation for preemption supported in Capacity scheduler (Naganarasimha G R via jlowe) + YARN-4653. Document YARN security model from the perspective of + Application Developers. (Steve Loughran via jianhe) + OPTIMIZATIONS BUG FIXES diff --git a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/YarnApplicationSecurity.md b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/YarnApplicationSecurity.md new file mode 100644 index 00000000000..640139f9831 --- /dev/null +++ b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/YarnApplicationSecurity.md @@ -0,0 +1,560 @@ + + +# YARN Application Security + +Anyone writing a YARN application needs to understand the process, in order +to write short-lived applications or long-lived services. They also need to +start testing on secure clusters during early development stages, in order +to write code that actually works. + +## How YARN Security works + +YARN Resource Managers (RMs) and Node Managers (NMs) co-operate to execute +the user's application with the identity and hence access rights of that user. + +The (active) Resource Manager: + +1. Finds space in a cluster to deploy the core of the application, +the Application Master (AM). + +1. Requests that the NM on that node allocate a container and start the AM in it. + +1. Communicates with the AM, so that the AM can request new containers and +manipulate/release current ones, and to provide notifications about allocated +and running containers. + +The Node Managers: + +1. *Localize* resources: Download from HDFS or other filesystem into a local directory. This +is done using the delegation tokens attached to the container launch context. (For non-HDFS +resources, using other credentials such as object store login details in cluster configuration +files) + +1. Start the application as the user. + +1. Monitor the application and report failure to the RM. + +To execute code in the cluster, a YARN application must: + +1. Have a client-side application which sets up the `ApplicationSubmissionContext` +detailing what is to be launched. This includes: + + * A list of files in the cluster's filesystem to be "localized". + * The environment variables to set in the container. + * The commands to execute in the container to start the application. + * Any security credentials needed by YARN to launch the application. + * Any security credentials needed by the application to interact + with any Hadoop cluster services and applications. + +1. Have an Application Master which, when launched, registers with +the YARN RM and listens for events. Any AM which wishes to execute work in +other containers must request them off the RM, and, when allocated, create +a `ContainerLaunchContext` containing the command to execute, the +environment to execute the command, binaries to localize and all relevant +security credentials. + +1. Even with the NM handling the localization process, the AM must itself +be able to retrieve the security credentials supplied at launch time so +that it itself may work with HDFS and any other services, and to pass some or +all of these credentials down to the launched containers. + +### Acquiring and Adding tokens to a YARN Application + +The delegation tokens which a YARN application needs must be acquired +from a program executing as an authenticated user. For a YARN application, +this means the user launching the application. It is the client-side part +of the YARN application which must do this: + +1. Log in via `UserGroupInformation`. +1. Identify all tokens which must be acquired. +1. Request these tokens from the specific Hadoop services. +1. Marshall all tokens into a byte buffer. +1. Add them to the `ContainerLaunchContext` within the `ApplicationSubmissionContext`. + +Which tokens are required? Normally, at least a token to access HDFS. + +An application must request a delegation token from every filesystem with +which it intends to interact —including the cluster's main FS. +`FileSystem.addDelegationTokens(renewer, credentials)` can be used to collect these; +it is a no-op on those filesystems which do not issue tokens (including +non-kerberized HDFS clusters). + +Applications talking to other services, such as Apache HBase and Apache Hive, +must request tokens from these services, using the libraries of these +services to acquire delegation tokens. All tokens can be added to the same +set of credentials, then saved to a byte buffer for submission. + +The Application Timeline Server also needs a delegation token. This is handled +automatically on AM launch. + +### Extracting tokens within the AM + +When the Application Master is launched and any of the UGI/Hadoop operations +which trigger a user login invoked, the UGI class will automatically load in all tokens +saved in the file named by the environment variable `HADOOP_TOKEN_FILE_LOCATION`. + +This happens on an insecure cluster along with a secure one, and on a secure +cluster even if a keytab is used by the application. Why? Because the +AM/RM token needed to authenticate the application with the YARN RM is always +supplied this way. + +This means you have a relative similar workflow across secure and insecure clusters. + +1. Suring AM startup, log in to Kerberos. +A call to `UserGroupInformation.isSecurityEnabled()` will trigger this operation. + +1. Enumerate the current user's credentials, through a call of +`UserGroupInformation.getCurrentUser().getCredentials()`. + +1. Filter out the AMRM token, resulting in a new set of credentials. In an +insecure cluster, the list of credentials will now be empty; in a secure cluster +they will contain + +1. Set the credentials of all containers to be launched to this (possibly empty) +list of credentials. + +1. If the filtered list of tokens to renew, is non-empty start up a thread +to renew them. + +### Token Renewal + +Tokens *expire*: they have a limited lifespan. An application wishing to +use a token past this expiry date must *renew* the token before the token +expires. + +Hadoop automatically sets up a delegation token renewal thread when needed, +the `DelegationTokenRenewer`. + +It is the responsibility of the application to renew all tokens other +than the AMRM and timeline tokens. + +Here are the different strategies + +1. Don't. Rely on the lifespan of the application being so short that token +renewal is not needed. For applications whose life can always be measured +in minutes or tens of minutes, this is a viable strategy. + +1. Start a background thread/Executor to renew the tokens at a regular interval. +This what most YARN applications do. + +## Other Aspects of YARN Security + + +### AM/RM Token Refresh + +The AM/RM token is renewed automatically; the AM pushes out a new token +to the AM within an `allocate` message. Consult the `AMRMClientImpl` class +to see the process. *Your AM code does not need to worry about this process* + +### Token Renewal on AM Restart + +Even if an application is renewing tokens regularly, if an AM fails and is +restarted, it gets restarted from that original +`ApplicationSubmissionContext`. The tokens there may have expired, so localization +may fail, even before the issue of credentials to talk to other services. + +How is this problem addressed? The YARN Resource Manager gets a new token +for the node managers, if needed. + +More precisely + +1. The token passed by the RM to the NM for localization is refreshed/updated as needed. +1. Tokens in the app launch context for use by the application are *not* refreshed. +That is, if it has an out of date HDFS token —that token is not renewed. This +also holds for tokens for for Hive, HBase, etc. +1. Therefore, to survive AM restart after token expiry, your AM has to get the +NMs to localize the keytab or make no HDFS accesses until (somehow) a new token has been passed to them from a client. + +This is primarily an issue for long-lived services (see below). + +### Unmanaged Application Masters + +Unmanaged application masters are not launched in a container set up by +the RM and NM, so cannot automatically pick up an AM/RM token at launch time. +The `YarnClient.getAMRMToken()` API permits an Unmanaged AM to request an AM/RM +token. Consult `UnmanagedAMLauncher` for the specifics. + +### Identity on an insecure cluster: `HADOOP_USER_NAME` + +In an insecure cluster, the application will run as the identity of +the account of the node manager, typically something such as `yarn` +or `mapred`. By default, the application will access HDFS +as that user, with a different home directory, and with +a different user identified in audit logs and on file system owner attributes. + +This can be avoided by having the client identify the identify of the +HDFS/Hadoop user under which the application is expected to run. *This +does not affect the OS-level user or the application's access rights +to the local machine*. + +When Kerberos is disabled, the identity of a user is picked up +by Hadoop first from the environment variable `HADOOP_USER_NAME`, +then from the OS-level username (e.g. the system property `user.name`). + +YARN applications should propagate the user name of the user launching +an application by setting this environment variable. + +```java +Map env = new HashMap<>(); +String userName = UserGroupInformation.getCurrentUser().getUserName(); +env.put(UserGroupInformation.HADOOP_USER_NAME, userName); +containerLaunchContext.setEnvironment(env); +``` + +Note that this environment variable is picked up in all applications +which talk to HDFS via the hadoop libraries. That is, if set, it +is the identity picked up by HBase and other applications executed +within the environment of a YARN container within which this environment +variable is set. + +### Oozie integration and `HADOOP_TOKEN_FILE_LOCATION` + +Apache Oozie can launch an application in a secure cluster either by acquiring +all relevant credentials, saving them to a file in the local filesystem, +then setting the path to this file in the environment variable +`HADOOP_TOKEN_FILE_LOCATION`. This is of course the same environment variable +passed down by YARN in launched containers, as is similar content: a byte +array with credentials. + +Here, however, the environment variable is set in the environment +executing the YARN client. This client must use the token information saved +in the named file *instead of acquiring any tokens of its own*. + +Loading in the token file is automatic: UGI does it during user login. + +The client is then responsible for passing the same credentials into the +AM launch context. This can be done simply by passing down the current +credentials. + +```java +credentials = new Credentials( + UserGroupInformation.getCurrentUser().getCredentials()); +``` + +### Timeline Server integration + +The [Application Timeline Server](TimelineServer.html) can be deployed as a secure service +—in which case the application will need the relevant token to authenticate with +it. This process is handled automatically in `YarnClientImpl` if ATS is +enabled in a secure cluster. Similarly, the AM-side `TimelineClient` YARN service +class manages token renewal automatically via the ATS's SPNEGO-authenticated REST API. + +If you need to prepare a set of delegation tokens for a YARN application launch +via Oozie, this can be done via the timeline client API. + +```java +try(TimelineClient timelineClient = TimelineClient.createTimelineClient()) { + timelineClient.init(conf); + timelineClient.start(); + Token token = + timelineClient.getDelegationToken(rmprincipal)); + credentials.addToken(token.getService(), token); +} +``` + +### Cancelling Tokens + +Applications *may* wish to cancel tokens they hold when terminating their AM. +This ensures that the tokens are no-longer valid. + +This is not mandatory, and as a clean shutdown of a YARN application cannot +be guaranteed, it is not possible to guarantee that the tokens will always +be during application termination. However, it does reduce the window of +vulnerability to stolen tokens. + +## Securing Long-lived YARN Services + +There is a time limit on all token renewals, after which tokens won't renew, +causing the application to stop working. This is somewhere between seventy-two +hours and seven days. + +Any YARN service intended to run for an extended period of time *must* have +a strategy for renewing credentials. + +Here are the strategies: + +### Pre-installed Keytabs for AM and containers + +A keytab is provided for the application's use on every node. + +This is done by: + +1. Installing it in every cluster node's local filesystem. +1. Providing the path to this in a configuration option. +1. The application loading the credentials via + `UserGroupInformation.loginUserFromKeytab()`. + +The keytab must be in a secure directory path, where +only the service (and other trusted accounts) can read it. Distribution +becomes a responsibility of the cluster operations team. + +This is effectively how all static Hadoop applications get their security credentials. + +### Keytabs for AM and containers distributed via YARN + + +1. A keytab is uploaded to HDFS. + +1. When launching the AM, the keytab is listed as a resource to localize to +the AM's container. + +1. The Application Master is configured with the relative path to the keytab, +and logs in with `UserGroupInformation.loginUserFromKeytab()`. + +1. When the AM launches the container, it lists the HDFS path to the keytab +as a resource to localize. + +1. It adds the HDFS delegation token to the container launch context, so +that the keytab and other application files can be localized. + +1. Launched containers must themselves log in via + `UserGroupInformation.loginUserFromKeytab()`. UGI handles the login, and + schedules a background thread to relogin the user periodically. + +1. Token creation is handled automatically in the Hadoop IPC and REST APIs, +the containers stay logged in via kerberos for their entire duration. + +This avoids the administration task of installing keytabs for specific services +across the entire cluster. + +It does require the client to have access to the keytab +and, as it is uploaded to the distributed filesystem, must be secured through +the appropriate path permissions/ACLs. + +As all containers have access to the keytab, all code executing in the containers +has to be trusted. Malicious code (or code escaping some form of sandbox) +could read the keytab, and hence have access to the cluster until the keys +expire or are revoked. + +This is the strategy implemented by Apache Slider (incubating). + +### AM keytab distributed via YARN; AM regenerates delegation tokens for containers. + +1. A keytab is uploaded to HDFS by the client. + +1. When launching the AM, the keytab is listed as a resource to localize to +the AM's container. + +1. The Application Master is configured with the relative path to the keytab, +and logs in with `UserGroupInformation.loginUserFromKeytab()`. The UGI +codepath will still automatically load the file references by +`$HADOOP_TOKEN_FILE_LOCATION`, which is how the AMRM token is picked up. + +1. When the AM launches a container, it acquires all the delegation tokens +needed by that container, and adds them to the container's container launch context. + +1. Launched containers must load the delegation tokens from `$HADOOP_TOKEN_FILE_LOCATION`, +and use them (including renewals) until they can no longer be renewed. + +1. The AM must implement an IPC interface which permits containers to request +a new set of delegation tokens; this interface must itself use authentication +and ideally wire encryption. + +1. Before a delegation token is due to expire, the processes running in the containers +must request new tokens from the Application Master over the IPC channel. + +1. When the containers need the new tokens, the AM, logged in with a keytab, + asks the various cluster services for new tokens. + +(Note there is an alternative direction for refresh operations: from AM + to the containers, again over whatever IPC channel is implemented between + AM and containers). The rest of the algorithm: AM regenerated tokens passed + to containers over IPC. + +This is the strategy used by Apache Spark 1.5+, with a netty-based protocol +between containers and the AM for token updates. + +Because only the AM has direct access to the keytab, it is less exposed. +Code running in the containers only has access to the delegation tokens. + +However, those containers will have access to HDFS from the tokens +passed in at container launch, so will have access to the copy of the keytab +used for launching the AM. While the AM could delete that keytab on launch, +doing so would stop YARN being able to successfully relaunch the AM after any +failure. + +### Client-side Token Push + +This strategy may be the sole one acceptable to a strict operations team: a client process +running on an account holding a Kerberos TGT negotiates with all needed cluster services +for new delegation tokens, tokens which are then pushed out to the Application Master via +some RPC interface. + +This does require the client process to be re-executed on a regular basis; a cron or Oozie job +can do this. The AM will need to implement an IPC API over which renewed +tokens can be provided. (Note that as Oozie can collect the tokens itself, +all the updater application needs to do whenever executed is set up an IPC +connection with the AM and pass up the current user's credentials). + +## Securing YARN Application Web UIs and REST APIs + +YARN provides a straightforward way of giving every YARN application SPNEGO authenticated +web pages: it implements SPNEGO authentication in the Resource Manager Proxy. + +YARN web UI are expected to load the AM proxy filter when setting up its web UI; this filter +will redirect all HTTP Requests coming from any host other than the RM Proxy hosts to an +RM proxy, to which the client app/browser must re-issue the request. The client will authenticate +against the principal of the RM Proxy (usually `yarn`), and, once authenticated, have its +request forwared. + +As a result, all client interactions are SPNEGO-authenticated, without the YARN application +itself needing any kerberos principal for the clients to authenticate against. + +Known weaknesses in this approach are: + +1. As calls coming from the proxy hosts are not redirected, any application running +on those hosts has unrestricted access to the YARN applications. This is why in a secure cluster +the proxy hosts *must* run on cluster nodes which do not run end user code (i.e. not run YARN +NodeManagers and hence schedule YARN containers, nor support logins by end users). + +1. The HTTP requests between proxy and YARN RM Server are not currently encrypted. +That is: HTTPS is not supported. + +## Securing YARN Application REST APIs + +YARN REST APIs running on the same port as the registered web UI of a YARN application are +automatically authenticated via SPNEGO authentication in the RM proxy. + +Any REST endpoint (and equally, any web UI) brought up on a different port does not +support SPNEGO authentication unless implemented in the YARN application itself. + +## Checklist for YARN Applications + +Here is the checklist of core actions which a YARN application must do +to successfully launch in a YARN cluster. + +### Client + +`[ ]` Client checks for security being enabled via `UserGroupInformation.isSecurityEnabled()` + +In a secure cluster: + +`[ ]` If `HADOOP_TOKEN_FILE_LOCATION` is unset, client acquires delegation tokens + for the local filesystems, with the RM principal set as the renewer. + +`[ ]` If `HADOOP_TOKEN_FILE_LOCATION` is unset, client acquires delegation tokens +for all other services to be used in the YARN application. + +`[ ]` If `HADOOP_TOKEN_FILE_LOCATION` is set, client uses the current user's credentials +as the source of all tokens to be added to the container launch context. + +`[ ]` Client sets all tokens on AM `ContainerLaunchContext.setTokens()`. + +`[ ]` Recommended: if it is set in the client's environment, +client sets the environment variable `HADOOP_JAAS_DEBUG=true` +in the Container launch context of the AM. + +In an insecure cluster: + +`[ ]` Propagate local username to YARN AM, hence HDFS identity via the +`HADOOP_USER_NAME` environment variable. + +### App Master + +`[ ]` In a secure cluster, AM retrieves security tokens from `HADOOP_TOKEN_FILE_LOCATION` +environment variable (automatically done by UGI). + +`[ ]` A copy the token set is filtered to remove the AM/RM token and any timeline +token. + +`[ ]` A thread or executor is started to renew threads on a regular basis. + +`[ ]` Recommended: AM cancels tokens when application completes. + +### Container Launch by AM + +`[ ]` Tokens to be passed to containers are passed via +`ContainerLaunchContext.setTokens()`. + +`[ ]` In an insecure cluster, propagate the `HADOOP_USER_NAME` environment variable. + +`[ ]` Recommended: AM sets the environment variable `HADOOP_JAAS_DEBUG=true` +in the Container launch context if it is set in the AM's environment. + +### Launched Containers + +`[ ]` Call `UserGroupInformation.isSecurityEnabled()` to trigger security setup. + +`[ ]` A thread or executor is started to renew threads on a regular basis. + +### YARN service + +`[ ]` Application developers have chosen and implemented a token renewal strategy: +shared keytab, AM keytab or client-side token refresh. + +`[ ]` In a secure cluster, the keytab is either already in HDFS (and checked for), +or it is in the local FS of the client, in which case it must be uploaded and added to +the list of resources to localize. + +`[ ]` If stored in HDFS, keytab permissions should be checked. If the keytab +is readable by principals other than the current user, warn, +and consider actually failing the launch (similar to the normal `ssh` application.) + +`[ ]` Client acquires HDFS delegation token and and attaches to the AM Container +Launch Context, + +`[ ]` AM logs in as principal in keytab via `loginUserFromKeytab()`. + +`[ ]` (AM extracts AM/RM token from the `HADOOP_TOKEN_FILE_LOCATION` environment +variable). + +`[ ]` For launched containers, either the keytab is propagated, or +the AM acquires/attaches all required delegation tokens to the Container Launch +context alongside the HDFS delegation token needed by the NMs. + +## Testing YARN applications in a secure cluster. + +It is straightforward to be confident that a YARN application works in secure +cluster. The process to do so is: test on a secure cluster. + +Even a single VM-cluster can be set up with security enabled. If doing so, +we recommend turning security up to its strictest, with SPNEGO-authenticated +Web UIs (and hence RM Proxy), as well as IPC wire encryption. Setting the +kerberos token expiry to under an hour will find kerberos expiry problems +early —so is also recommended. + +`[ ]` Application launched in secure cluster. + +`[ ]` Launched application runs as user submitting job (tip: log `user.name` +system property in AM). + +`[ ]` Web browser interaction verified in secure cluster. + +`[ ]` REST client interation (GET operations) tested. + +`[ ]` Application continues to run after Kerberos Token expiry. + +`[ ]` Application does not launch if user lacks Kerberos credentials. + +`[ ]` If the application supports the timeline server, verify that it publishes +events in a secure cluster. + +`[ ]` If the application integrates with other applications, such as HBase or Hive, +verify that the interaction works in a secure cluster. + +`[ ]` If the application communicates with remote HDFS clusters, verify +that it can do so in a secure cluster (i.e. that the client extracted any +delegation tokens for this at launch time) + +## Important + +*If you don't test your YARN application in a secure Hadoop cluster, +it won't work.* + +And without those tests: *your users will be the ones to find out +that your application doesn't work in a secure cluster.* + +Bear that in mind when considering how much development effort to put into +Kerberos support.