YARN-6807. Adding required missing configs to Federation configuration guide based on e2e testing. (Tanuj Nayak via Subru).

This commit is contained in:
Subru Krishnan 2017-07-13 18:44:32 -07:00 committed by Carlo Curino
parent 52daa6d971
commit b4ac9d1b63
1 changed files with 49 additions and 4 deletions

View File

@ -86,6 +86,8 @@ of the desirable properties of balance, optimal cluster utilization and global i
*NOTE*: In the current implementation the GPG is a manual tuning process, simply exposed via a CLI (YARN-3657).
This part of the federation system is part of future work in [YARN-5597](https://issues.apache.org/jira/browse/YARN-5597).
###Federation State-Store
The Federation State defines the additional state that needs to be maintained to loosely couple multiple individual sub-clusters into a single large federated cluster. This includes the following information:
@ -159,7 +161,7 @@ These are common configurations that should appear in the **conf/yarn-site.xml**
|:---- |:---- |
|`yarn.federation.enabled` | `true` | Whether federation is enabled or not |
|`yarn.federation.state-store.class` | `org.apache.hadoop.yarn.server.federation.store.impl.SQLFederationStateStore` | The type of state-store to use. |
|`yarn.federation.state-store.sql.url` | `jdbc:sqlserver://<host>:<port>;database` | For SQLFederationStateStore the name of the DB where the state is stored. |
|`yarn.federation.state-store.sql.url` | `jdbc:sqlserver://<host>:<port>;databaseName=FederationStateStore` | For SQLFederationStateStore the name of the DB where the state is stored. |
|`yarn.federation.state-store.sql.jdbc-class` | `com.microsoft.sqlserver.jdbc.SQLServerDataSource` | For SQLFederationStateStore the jdbc class to use. |
|`yarn.federation.state-store.sql.username` | `<dbuser>` | For SQLFederationStateStore the username for the DB connection. |
|`yarn.federation.state-store.sql.password` | `<dbpass>` | For SQLFederationStateStore the password for the DB connection. |
@ -175,7 +177,7 @@ Optional:
|`yarn.federation.policy-manager` | `org.apache.hadoop.yarn.server.federation.policies.manager.WeightedLocalityPolicyManager` | The choice of policy manager determines how Applications and ResourceRequests are routed through the system. |
|`yarn.federation.policy-manager-params` | `<binary>` | The payload that configures the policy. In our example a set of weights for router and amrmproxy policies. This is typically generated by serializing a policymanager that has been configured programmatically, or by populating the state-store with the .json serialized form of it. |
|`yarn.federation.subcluster-resolver.class` | `org.apache.hadoop.yarn.server.federation.resolver.DefaultSubClusterResolverImpl` | The class used to resolve which subcluster a node belongs to, and which subcluster(s) a rack belongs to. |
| `yarn.federation.machine-list` | `node1,subcluster1,rack1\n node2 , subcluster2, RACK1\n noDE3,subcluster3, rack2\n node4, subcluster3, rack2\n` | a list of Nodes, Sub-clusters, Rack, used by the `DefaultSubClusterResolverImpl` |
| `yarn.federation.machine-list` | `node1,subcluster1,rack1\n node2 , subcluster2, RACK1\n node3,subcluster3, rack2\n node4, subcluster3, rack2\n` | a list of Nodes, Sub-clusters, Rack, used by the `DefaultSubClusterResolverImpl` |
###ON RMs:
@ -200,6 +202,7 @@ These are extra configurations that should appear in the **conf/yarn-site.xml**
| Property | Example | Description |
|:---- |:---- |
|`yarn.router.bind-host` | `0.0.0.0` | Host IP to bind the router to. The actual address the server will bind to. If this optional address is set, the RPC and webapp servers will bind to this address and the port specified in yarn.router.*.address respectively. This is most useful for making Router listen to all interfaces by setting to 0.0.0.0. |
| `yarn.router.clientrm.interceptor-class.pipeline` | `org.apache.hadoop.yarn.server.router.clientrm.FederationClientInterceptor` | A comma-seperated list of interceptor classes to be run at the router when interfacing with the client. The last step of this pipeline must be the Federation Client Interceptor. |
Optional:
@ -222,11 +225,53 @@ These are extra configurations that should appear in the **conf/yarn-site.xml**
| Property | Example | Description |
|:---- |:---- |
|`yarn.nodemanager.amrmproxy.interceptor-class.pipeline` | `org.apache.hadoop.yarn.server.nodemanager.amrmproxy.FederationInterceptor` | A coma-separated list of interceptors to be run at the amrmproxy. For federation the last step in the pipeline should be the FederationInterceptor. |
| `yarn.nodemanager.amrmproxy.enabled` | `true` | Whether or not the AMRMProxy is enabled.
|`yarn.nodemanager.amrmproxy.interceptor-class.pipeline` | `org.apache.hadoop.yarn.server.nodemanager.amrmproxy.FederationInterceptor` | A comma-separated list of interceptors to be run at the amrmproxy. For federation the last step in the pipeline should be the FederationInterceptor.
| `yarn.client.failover.proxy-provider` | `org.apache.hadoop.yarn.server.federation.failover.FederationRMFailoverProxyProvider` | The class used to connect to the RMs by looking up the membership information in federation state-store. This must be set if federation is enabled, even if RM HA is not enabled.|
Optional:
| Property | Example | Description |
|:---- |:---- |
|`yarn.federation.statestore.max-connections` | `1` | The maximum number of parallel connections from each AMRMProxy to the state-store. This value is typically lower than the router one, since we have many AMRMProxy that could burn-through many DB connections quickly. |
|`yarn.federation.cache-ttl.secs` | `300` | The time to leave for the AMRMProxy cache. Typically larger than at the router, as the number of AMRMProxy is large, and we want to limit the load to the centralized state-store. |
|`yarn.federation.cache-ttl.secs` | `300` | The time to leave for the AMRMProxy cache. Typically larger than at the router, as the number of AMRMProxy is large, and we want to limit the load to the centralized state-store. |
###State-Store:
Currently, the only supported implementation of the state-store is Microsoft SQL Server. After [setting up](https://www.microsoft.com/en-us/sql-server/sql-server-downloads) such an instance of SQL Server, set up the database for use by the federation system. This can be done by running the following SQL files in the database: **sbin/FederationStateStore/SQLServer/FederationStateStoreStoreProcs.sql** and **sbin/FederationStateStore/SQLServer/FederationStateStoreStoreTables.sql**
Running a Sample Job
--------------------
In order to submit jobs to a Federation cluster one must create a seperate set of configs for the client from which said jobs will be submitted. In these, the **conf/yarn-site.xml** should have the following additional configurations:
| Property | Example | Description |
|:--- |:--- |
| `yarn.resourcemanager.address` | `<router_host>:8050` | Redirects jobs launched at the client to the router's client RM port. |
| `yarn.resourcemanger.scheduler.address` | `localhost:8049` | Redirects jobs to the federation AMRMProxy port.|
Any YARN jobs for the cluster can be submitted from the client configurations described above. In order to launch a job through federation, first start up all the clusters involved in the federation as described [here](../../hadoop-project-dist/hadoop-common/ClusterSetup.html). Next, start up the router on the router machine with the following command:
$HADOOP_HOME/bin/yarn --daemon start router
Now with $HADOOP_CONF_DIR pointing to the client configurations folder that is described above, run your job the usual way. The configurations in the client configurations folder described above will direct the job to the router's client RM port where the router should be listening after being started. Here is an example run of a Pi job on a federation cluster from the client:
$HADOOP_HOME/bin/yarn jar hadoop-mapreduce-examples-3.0.0.jar pi 16 1000
This job is submitted to the router which as described above, uses a generated policy from the [GPG](#Global_Policy_Generator) to pick a home RM for the job to which it is submitted.
The output from this particular example job should be something like:
2017-07-13 16:29:25,055 INFO mapreduce.Job: Job job_1499988226739_0001 running in uber mode : false
2017-07-13 16:29:25,056 INFO mapreduce.Job: map 0% reduce 0%
2017-07-13 16:29:33,131 INFO mapreduce.Job: map 38% reduce 0%
2017-07-13 16:29:39,176 INFO mapreduce.Job: map 75% reduce 0%
2017-07-13 16:29:45,217 INFO mapreduce.Job: map 94% reduce 0%
2017-07-13 16:29:46,228 INFO mapreduce.Job: map 100% reduce 100%
2017-07-13 16:29:46,235 INFO mapreduce.Job: Job job_1499988226739_0001 completed successfully
.
.
.
Job Finished in 30.586 seconds
Estimated value of Pi is 3.14250000......
Note that no change in the code or recompilation of the input jar was required to use federation. Also, the output of this job is the exact same as it would be when run without federation. Also, in order to get the full benefit of federation, use a large enough number of mappers such that more than one cluster is required. That number happens to be 16 in the case of the above example.