diff --git a/hadoop-common-project/hadoop-common/src/main/docs/releasenotes.html b/hadoop-common-project/hadoop-common/src/main/docs/releasenotes.html index b58c609db20..64ba63b4ba7 100644 --- a/hadoop-common-project/hadoop-common/src/main/docs/releasenotes.html +++ b/hadoop-common-project/hadoop-common/src/main/docs/releasenotes.html @@ -15,6 +15,622 @@ limitations under the License. --> +
The new AM fails to start after RM restarts. It fails to start new Application master and job fails with below error. + + /usr/bin/mapred job -status job_1380985373054_0001 +13/10/05 15:04:04 INFO client.RMProxy: Connecting to ResourceManager at hostname +Job: job_1380985373054_0001 +Job File: /user/abc/.staging/job_1380985373054_0001/job.xml +Job Tracking URL : http://hostname:8088/cluster/app/application_1380985373054_0001 +Uber job : false +Number of maps: 0 +Number of reduces: 0 +map() completion: 0.0 +reduce() completion: 0.0 +Job state: FAILED +retired: false +reason for failure: There are no failed tasks for the job. Job is failed due to some other reason and reason can be found in the logs. +Counters: 0
This YARN part of HADOOP-10022.
LCE container launch assumes the usercache/USER directory exists and it is owned by the user running the container process. + +But the directory is created only if there are resources to localize by the LCE localization command, if there are not resourcdes to localize, LCE localization never executes and launching fails reporting 255 exit code and the NM logs have something like: + +{code} +2013-10-04 14:07:56,425 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: main : command provided 1 +2013-10-04 14:07:56,425 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: main : user is llama +2013-10-04 14:07:56,425 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Can't create directory llama in /yarn/nm/usercache/llama/appcache/application_1380853306301_0004/container_1380853306301_0004_01_000004 - Permission denied +{code} +
2013-10-04 22:09:15,234 ERROR [org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl #1] distributedshell.ApplicationMaster (ApplicationMaster.java:onStartContainerError(719)) - Failed to start Container container_1380920347574_0018_01_000006
The error is shown below in the comments. + +MAPREDUCE-2374 fixed this by removing "-c" when running the container launch script. It looks like the "-c" got brought back during the windows branch merge, so we should remove it again.
TestApplicationCleanup submits container requests and waits for allocations to come in. It only sends a single node heartbeat to the node, expecting multiple containers to be assigned on this heartbeat, which not all schedulers do by default. + +This is causing the test to fail when run with the Fair Scheduler.
This issue happens in multiple node cluster where resource manager and node manager are running on different machines. + +Steps to reproduce: +1) set yarn.resourcemanager.hostname = <resourcemanager host> in yarn-site.xml +2) set hadoop.ssl.enabled = true in core-site.xml +3) Do not specify below property in yarn-site.xml +yarn.nodemanager.webapp.https.address and yarn.resourcemanager.webapp.https.address +Here, the default value of above two property will be considered. +4) Go to nodemanager web UI "https://<nodemanager host>:8044/node" +5) Click on RM_HOME link +This link redirects to "https://<nodemanager host>:8090/cluster" instead "https://<resourcemanager host>:8090/cluster" +
A container can set token service metadata for a service, say shuffle_service. If that service does not exist then the errors is silently ignored. Later, when the next container wants to access data written to shuffle_service by the first task, then it fails because the service does not have the token that was supposed to be set by the first task.
Before launching the container, NM is using the same credential object and so is polluting what container should see. We should fix this.
TestDistributedShell#TestDSShell on trunk Jenkins are failed consistently recently. +The Stacktrace is: +{code} +java.lang.Exception: test timed out after 90000 milliseconds + at com.google.protobuf.LiteralByteString.<init>(LiteralByteString.java:234) + at com.google.protobuf.ByteString.copyFromUtf8(ByteString.java:255) + at org.apache.hadoop.ipc.protobuf.ProtobufRpcEngineProtos$RequestHeaderProto.getMethodNameBytes(ProtobufRpcEngineProtos.java:286) + at org.apache.hadoop.ipc.protobuf.ProtobufRpcEngineProtos$RequestHeaderProto.getSerializedSize(ProtobufRpcEngineProtos.java:462) + at com.google.protobuf.AbstractMessageLite.writeDelimitedTo(AbstractMessageLite.java:84) + at org.apache.hadoop.ipc.ProtobufRpcEngine$RpcMessageWithHeader.write(ProtobufRpcEngine.java:302) + at org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:989) + at org.apache.hadoop.ipc.Client.call(Client.java:1377) + at org.apache.hadoop.ipc.Client.call(Client.java:1357) + at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) + at $Proxy70.getApplicationReport(Unknown Source) + at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationReport(ApplicationClientProtocolPBClientImpl.java:137) + at sun.reflect.GeneratedMethodAccessor40.invoke(Unknown Source) + at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) + at java.lang.reflect.Method.invoke(Method.java:597) + at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:185) + at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101) + at $Proxy71.getApplicationReport(Unknown Source) + at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationReport(YarnClientImpl.java:195) + at org.apache.hadoop.yarn.applications.distributedshell.Client.monitorApplication(Client.java:622) + at org.apache.hadoop.yarn.applications.distributedshell.Client.run(Client.java:597) + at org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShell(TestDistributedShell.java:125) +{code} +For details, please refer: +https://builds.apache.org/job/PreCommit-YARN-Build/2039//testReport/
If run under the super-user account test-container-executor.c fails in multiple different places. It would be nice to fix it so that we have better testing of LCE functionality.
Since there is no yarn history server it becomes difficult to determine what the status of an old application is. One has to be familiar with the state transition in yarn to know what means a success. + +We should add a log at info level that captures what the finalStatus of an app is. This would be helpful while debugging applications if the RM has restarted and we no longer can use the UI.
The fair scheduler sometimes picks a different queue than the one an application was submitted to, such as when user-as-default-queue is turned on. It needs to update the queue name in the RMApp so that this choice will be reflected in the UI. + +This isn't working because the scheduler is looking up the RMApp by application attempt id instead of app id and failing to find it.
I run sleep job. If AM fails to start, this exception could occur: + +13/09/20 11:00:23 INFO mapreduce.Job: Job job_1379673267098_0020 failed with state FAILED due to: Application application_1379673267098_0020 failed 1 times due to AM Container for appattempt_1379673267098_0020_000001 exited with exitCode: 1 due to: Exception from container-launch: +org.apache.hadoop.util.Shell$ExitCodeException: /myappcache/application_1379673267098_0020/container_1379673267098_0020_01_000001/launch_container.sh: line 12: export: `NM_AUX_SERVICE_mapreduce.shuffle=AAA0+gAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA= +': not a valid identifier + +at org.apache.hadoop.util.Shell.runCommand(Shell.java:464) +at org.apache.hadoop.util.Shell.run(Shell.java:379) +at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589) +at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195) +at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:270) +at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:78) +at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) +at java.util.concurrent.FutureTask.run(FutureTask.java:138) +at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) +at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) +at java.lang.Thread.run(Thread.java:662) +.Failing this attempt.. Failing the application.
Currently the Fair Scheduler is configured in two ways +* An allocations file that has a different format than the standard Hadoop configuration file, which makes it easier to specify hierarchical objects like queues and their properties. +* With properties like yarn.scheduler.fair.max.assign that are specified in the standard Hadoop configuration format. + +The standard and default way of configuring it is to use fair-scheduler.xml as the allocations file and to put the yarn.scheduler properties in yarn-site.xml. + +It is also possible to specify a different file as the allocations file, and to place the yarn.scheduler properties in fair-scheduler.xml, which will be interpreted as in the standard Hadoop configuration format. This flexibility is both confusing and unnecessary. + +Additionally, the allocation file is loaded as fair-scheduler.xml from the classpath if it is not specified, but is loaded as a File if it is. This causes two problems +1. We see different behavior when not setting the yarn.scheduler.fair.allocation.file, and setting it to fair-scheduler.xml, which is its default. +2. Classloaders may choose to cache resources, which can break the reload logic when yarn.scheduler.fair.allocation.file is not specified. + +We should never allow the yarn.scheduler properties to go into fair-scheduler.xml. And we should always load the allocations file as a file, not as a resource on the classpath. To preserve existing behavior and allow loading files from the classpath, we can look for files on the classpath, but strip of their scheme and interpret them as Files. +
While running a Hive join operation on Yarn, I saw exception as described below. This is caused by FSDownload copy the files into a temp file and change the suffix into ".tmp" before unpacking it. In unpack(), it uses FileUtil.unTar() which will determine if the file is "gzipped" by looking at the file suffix: +{code} +boolean gzipped = inFile.toString().endsWith("gz"); +{code} + +To fix this problem, we can remove the ".tmp" in the temp file name. + +Here is the detailed exception: + +org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextTarEntry(TarArchiveInputStream.java:240) + at org.apache.hadoop.fs.FileUtil.unTarUsingJava(FileUtil.java:676) + at org.apache.hadoop.fs.FileUtil.unTar(FileUtil.java:625) + at org.apache.hadoop.yarn.util.FSDownload.unpack(FSDownload.java:203) + at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:287) + at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:50) + at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) + at java.util.concurrent.FutureTask.run(FutureTask.java:166) + at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) + at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) + at java.util.concurrent.FutureTask.run(FutureTask.java:166) + at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) + at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) + +at java.lang.Thread.run(Thread.java:722)
In the {{org.apache.hadoop.yarn.api.records.URL}} class, we don't have an userinfo as part of the URL. When converting a {{java.net.URI}} object into the YARN URL object in {{ConverterUtils.getYarnUrlFromURI()}} method, we will set uri host as the url host. If the uri has a userinfo part, the userinfo is discarded. This will lead to information loss if the original uri has the userinfo, e.g. foo://username:password@example.com will be converted to foo://example.com and username/password information is lost during the conversion. + +
Currently, app attempt ClientToken master key is registered before it is saved. This can cause problem that before the master key is saved, client gets the token and RM also crashes, RM cannot reloads the master key back after it restarts as it is not saved. As a result, client is holding an invalid token. + +We can register the client token master key after it is saved in the store.
There is no yarn property available to configure https port for Resource manager, nodemanager and history server. Currently, Yarn services uses the port defined for http [defined by 'mapreduce.jobhistory.webapp.address','yarn.nodemanager.webapp.address', 'yarn.resourcemanager.webapp.address'] for running services on https protocol. + +Yarn should have list of property to assign https port for RM, NM and JHS. +It can be like below. +yarn.nodemanager.webapp.https.address +yarn.resourcemanager.webapp.https.address +mapreduce.jobhistory.webapp.https.address
Need to add support to disable 'hadoop.ssl.enabled' for MR jobs. + +A job should be able to run on http protocol by setting 'hadoop.ssl.enabled' property at job level. +
Submit distributed shell application. Once the application turns to be RUNNING state, app master host should not be empty. In reality, it is empty. + +==console logs== +distributedshell.Client: Got application report from ASM for, appId=12, clientToAMToken=null, appDiagnostics=, appMasterHost=, appQueue=default, appMasterRpcPort=0, appStartTime=1378505161360, yarnAppState=RUNNING, distributedFinalState=UNDEFINED, +
Submit YARN distributed shell application. Goto ResourceManager Web UI. The application definitely appears. In Tracking UI column, there will be history link. Click on that link. Instead of showing application master web UI, HTTP error 500 would appear.
When nodemanager receives a kill signal when an application has finished execution but log aggregation has not kicked in, InvalidStateTransitonException: Invalid event: APPLICATION_LOG_HANDLING_FINISHED at RUNNING is thrown + +{noformat} +2013-08-25 20:45:00,875 INFO logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:finishLogAggregation(254)) - Application just finished : application_1377459190746_0118 +2013-08-25 20:45:00,876 INFO logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:uploadLogsForContainer(105)) - Starting aggregate log-file for app application_1377459190746_0118 at /app-logs/foo/logs/application_1377459190746_0118/<host>_45454.tmp +2013-08-25 20:45:00,876 INFO logaggregation.LogAggregationService (LogAggregationService.java:stopAggregators(151)) - Waiting for aggregation to complete for application_1377459190746_0118 +2013-08-25 20:45:00,891 INFO logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:uploadLogsForContainer(122)) - Uploading logs for container container_1377459190746_0118_01_000004. Current good log dirs are /tmp/yarn/local +2013-08-25 20:45:00,915 INFO logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:doAppLogAggregation(182)) - Finished aggregate log-file for app application_1377459190746_0118 +2013-08-25 20:45:00,925 WARN application.Application (ApplicationImpl.java:handle(427)) - Can't handle this event at current state +org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: APPLICATION_LOG_HANDLING_FINISHED at RUNNING + at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) + at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) + at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) + at org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.handle(ApplicationImpl.java:425) + at org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.handle(ApplicationImpl.java:59) + at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher.handle(ContainerManagerImpl.java:697) + at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher.handle(ContainerManagerImpl.java:689) + at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:134) + at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:81) + at java.lang.Thread.run(Thread.java:662) +2013-08-25 20:45:00,926 INFO application.Application (ApplicationImpl.java:handle(430)) - Application application_1377459190746_0118 transitioned from RUNNING to null +2013-08-25 20:45:00,927 WARN monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(463)) - org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl is interrupted. Exiting. +2013-08-25 20:45:00,938 INFO ipc.Server (Server.java:stop(2437)) - Stopping server on 8040 +{noformat} + +
Currently, in CapacityScheduler and FifoScheduler, blacklist is updated together with resource requests, only when the incoming resource requests are not empty. Therefore, when the incoming resource requests are empty, the blacklist will not be updated even when blacklist additions and removals are not empty.
In the case when log aggregation is enabled, if a user submits MapReduce job and runs $ yarn logs -applicationId <app ID> while the YARN application is running, the command will return no message and return user back to shell. It is nice to tell the user that log aggregation is in progress. + +{code} +-bash-4.1$ /usr/bin/yarn logs -applicationId application_1377900193583_0002 +-bash-4.1$ +{code} + +At the same time, if invalid application ID is given, YARN CLI should say that the application ID is incorrect rather than throwing NoSuchElementException. +{code} +$ /usr/bin/yarn logs -applicationId application_00000 +Exception in thread "main" java.util.NoSuchElementException +at com.google.common.base.AbstractIterator.next(AbstractIterator.java:75) +at org.apache.hadoop.yarn.util.ConverterUtils.toApplicationId(ConverterUtils.java:124) +at org.apache.hadoop.yarn.util.ConverterUtils.toApplicationId(ConverterUtils.java:119) +at org.apache.hadoop.yarn.logaggregation.LogDumper.run(LogDumper.java:110) +at org.apache.hadoop.yarn.logaggregation.LogDumper.main(LogDumper.java:255) + +{code} +
FifoPolicy gives all of a queue's share to the earliest-scheduled application. + +{code} + Schedulable earliest = null; + for (Schedulable schedulable : schedulables) { + if (earliest == null || + schedulable.getStartTime() < earliest.getStartTime()) { + earliest = schedulable; + } + } + earliest.setFairShare(Resources.clone(totalResources)); +{code} + +If the queue has no schedulables in it, earliest will be left null, leading to an NPE on the last line.
When there is no resource available to run a job, next job should go in pending state. RM UI should show next job as pending app and the counter for the pending app should be incremented. + +But Currently. Next job stays in ACCEPTED state and No AM has been assigned to this job.Though Pending App count is not incremented. +Running 'job status <nextjob>' shows job state=PREP. + +$ mapred job -status job_1377122233385_0002 +13/08/21 21:59:23 INFO client.RMProxy: Connecting to ResourceManager at host1/ip1 + +Job: job_1377122233385_0002 +Job File: /ABC/.staging/job_1377122233385_0002/job.xml +Job Tracking URL : http://host1:port1/application_1377122233385_0002/ +Uber job : false +Number of maps: 0 +Number of reduces: 0 +map() completion: 0.0 +reduce() completion: 0.0 +Job state: PREP +retired: false +reason for failure:
We found a case where our rack resolve script was not returning rack due to problem with resolving host address. This exception was see in RackResolver.java as NPE, ultimately caught in RMContainerAllocator. + +{noformat} +2013-08-01 07:11:37,708 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN CONTACTING RM. +java.lang.NullPointerException + at org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:99) + at org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:92) + at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assignMapsWithLocality(RMContainerAllocator.java:1039) + at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assignContainers(RMContainerAllocator.java:925) + at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:861) + at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$400(RMContainerAllocator.java:681) + at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:219) + at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:243) + at java.lang.Thread.run(Thread.java:722) + +{noformat}
The Capacity Scheduler documents the yarn.scheduler.capacity.root.<queue-path>.acl_administer_queue config option for controlling who can administer a queue, but it is not hooked up to anything. The Fair Scheduler could make use of a similar option as well. This is a feature-parity regression from MR1.
+From the yarn-site.xml, I see following values- +<property> +<name>yarn.nodemanager.resource.memory-mb</name> +<value>4192</value> +</property> +<property> +<name>yarn.scheduler.maximum-allocation-mb</name> +<value>4192</value> +</property> +<property> +<name>yarn.scheduler.minimum-allocation-mb</name> +<value>1024</value> +</property> + +However the resourcemanager UI shows total memory as 5MB +
When an unhealthy restarts, its resource maybe added twice in scheduler. +First time is at node's reconnection, while node's final state is still "UNHEALTHY". +And second time is at node's update, while node's state changing from "UNHEALTHY" to "HEALTHY".
On a secure YARN setup, before the first job is executed, going to the web interface of the resource manager triggers authentication errors.
During review of symbolic links, many issues were found related impact on semantics of existing APIs such FileSystem#listStatus, FileSystem#globStatus etc. There were also many issues brought up about symbolic links and the impact on security and functionality of HDFS. All these issues will be address in the upcoming release 2.3. Until then the feature is temporarily disabled.