diff --git a/hadoop-common-project/hadoop-common/src/main/docs/releasenotes.html b/hadoop-common-project/hadoop-common/src/main/docs/releasenotes.html index 6af2f47e3e6..0c0c2e790b0 100644 --- a/hadoop-common-project/hadoop-common/src/main/docs/releasenotes.html +++ b/hadoop-common-project/hadoop-common/src/main/docs/releasenotes.html @@ -27,9 +27,9 @@ These release notes include new developer and user-facing incompatibilities, fea
  • YARN-357. Major bug reported by Daryn Sharp and fixed by Daryn Sharp (resourcemanager)
    App submission should not be synchronized
    -
    MAPREDUCE-2953 fixed a race condition with querying of app status by making {{RMClientService#submitApplication}} synchronously invoke {{RMAppManager#submitApplication}}. However, the {{synchronized}} keyword was also added to {{RMAppManager#submitApplication}} with the comment: -bq. I made the submitApplication synchronized to keep it consistent with the other routines in RMAppManager although I do not believe it needs it since the rmapp datastructure is already a concurrentMap and I don't see anything else that would be an issue. - +
    MAPREDUCE-2953 fixed a race condition with querying of app status by making {{RMClientService#submitApplication}} synchronously invoke {{RMAppManager#submitApplication}}. However, the {{synchronized}} keyword was also added to {{RMAppManager#submitApplication}} with the comment: +bq. I made the submitApplication synchronized to keep it consistent with the other routines in RMAppManager although I do not believe it needs it since the rmapp datastructure is already a concurrentMap and I don't see anything else that would be an issue. + It's been observed that app submission latency is being unnecessarily impacted.
  • YARN-355. Blocker bug reported by Daryn Sharp and fixed by Daryn Sharp (resourcemanager)
    @@ -38,22 +38,22 @@ It's been observed that app submission latency is being unnecessarily impacted.<
  • YARN-354. Blocker bug reported by Liang Xie and fixed by Liang Xie
    WebAppProxyServer exits immediately after startup
    -
    Please see HDFS-4426 for detail, i found the yarn WebAppProxyServer is broken by HADOOP-9181 as well, here's the hot fix, and i verified manually in our test cluster. - +
    Please see HDFS-4426 for detail, i found the yarn WebAppProxyServer is broken by HADOOP-9181 as well, here's the hot fix, and i verified manually in our test cluster. + I'm really applogized for bring about such trouble...
  • YARN-343. Major bug reported by Thomas Graves and fixed by Xuan Gong (capacityscheduler)
    Capacity Scheduler maximum-capacity value -1 is invalid
    -
    I tried to start the resource manager using the capacity scheduler with a particular queues maximum-capacity set to -1 which is supposed to disable it according to the docs but I got the following exception: - -java.lang.IllegalArgumentException: Illegal value of maximumCapacity -0.01 used in call to setMaxCapacity for queue foo - at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueUtils.checkMaxCapacity(CSQueueUtils.java:31) - at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.setupQueueConfigs(LeafQueue.java:220) - at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.<init>(LeafQueue.java:191) - at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.parseQueue(CapacityScheduler.java:310) - at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.parseQueue(CapacityScheduler.java:325) - at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initializeQueues(CapacityScheduler.java:232) - at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:202) +
    I tried to start the resource manager using the capacity scheduler with a particular queues maximum-capacity set to -1 which is supposed to disable it according to the docs but I got the following exception: + +java.lang.IllegalArgumentException: Illegal value of maximumCapacity -0.01 used in call to setMaxCapacity for queue foo + at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueUtils.checkMaxCapacity(CSQueueUtils.java:31) + at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.setupQueueConfigs(LeafQueue.java:220) + at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.<init>(LeafQueue.java:191) + at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.parseQueue(CapacityScheduler.java:310) + at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.parseQueue(CapacityScheduler.java:325) + at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initializeQueues(CapacityScheduler.java:232) + at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:202)
  • YARN-336. Major bug reported by Sandy Ryza and fixed by Sandy Ryza (scheduler)
    @@ -62,57 +62,57 @@ java.lang.IllegalArgumentException: Illegal value of maximumCapacity -0.01 used
  • YARN-334. Critical bug reported by Thomas Graves and fixed by Thomas Graves
    Maven RAT plugin is not checking all source files
    -
    yarn side of HADOOP-9097 - - - -Running 'mvn apache-rat:check' passes, but running RAT by hand (by downloading the JAR) produces some warnings for Java files, amongst others. +
    yarn side of HADOOP-9097 + + + +Running 'mvn apache-rat:check' passes, but running RAT by hand (by downloading the JAR) produces some warnings for Java files, amongst others.
  • YARN-331. Major improvement reported by Sandy Ryza and fixed by Sandy Ryza (scheduler)
    Fill in missing fair scheduler documentation
    -
    In the fair scheduler documentation, a few config options are missing: -locality.threshold.node -locality.threshold.rack -max.assign -aclSubmitApps -minSharePreemptionTimeout +
    In the fair scheduler documentation, a few config options are missing: +locality.threshold.node +locality.threshold.rack +max.assign +aclSubmitApps +minSharePreemptionTimeout
  • YARN-330. Major bug reported by Hitesh Shah and fixed by Sandy Ryza (nodemanager)
    Flakey test: TestNodeManagerShutdown#testKillContainersOnShutdown
    -
    =Seems to be timing related as the container status RUNNING as returned by the ContainerManager does not really indicate that the container task has been launched. Sleep of 5 seconds is not reliable. - -Running org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown -Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 9.353 sec <<< FAILURE! -testKillContainersOnShutdown(org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown) Time elapsed: 9283 sec <<< FAILURE! -junit.framework.AssertionFailedError: Did not find sigterm message - at junit.framework.Assert.fail(Assert.java:47) - at junit.framework.Assert.assertTrue(Assert.java:20) - at org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.testKillContainersOnShutdown(TestNodeManagerShutdown.java:162) - -Logs: - -2013-01-09 14:13:08,401 INFO [AsyncDispatcher event handler] container.Container (ContainerImpl.java:handle(835)) - Container container_0_0000_01_000000 transitioned from NEW to LOCALIZING -2013-01-09 14:13:08,412 INFO [AsyncDispatcher event handler] localizer.LocalizedResource (LocalizedResource.java:handle(194)) - Resource file:hadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown/tmpDir/scriptFile.sh transitioned from INIT to DOWNLOADING -2013-01-09 14:13:08,412 INFO [AsyncDispatcher event handler] localizer.ResourceLocalizationService (ResourceLocalizationService.java:handle(521)) - Created localizer for container_0_0000_01_000000 -2013-01-09 14:13:08,589 INFO [LocalizerRunner for container_0_0000_01_000000] localizer.ResourceLocalizationService (ResourceLocalizationService.java:writeCredentials(895)) - Writing credentials to the nmPrivate file hadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown/nm0/nmPrivate/container_0_0000_01_000000.tokens. Credentials list: -2013-01-09 14:13:08,628 INFO [LocalizerRunner for container_0_0000_01_000000] nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:createUserCacheDirs(373)) - Initializing user nobody -2013-01-09 14:13:08,709 INFO [main] containermanager.ContainerManagerImpl (ContainerManagerImpl.java:getContainerStatus(538)) - Returning container_id {, app_attempt_id {, application_id {, id: 0, cluster_timestamp: 0, }, attemptId: 1, }, }, state: C_RUNNING, diagnostics: "", exit_status: -1000, -2013-01-09 14:13:08,781 INFO [LocalizerRunner for container_0_0000_01_000000] nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:startLocalizer(99)) - Copying from hadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown/nm0/nmPrivate/container_0_0000_01_000000.tokens to hadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown/nm0/usercache/nobody/appcache/application_0_0000/container_0_0000_01_000000.tokens - - +
    =Seems to be timing related as the container status RUNNING as returned by the ContainerManager does not really indicate that the container task has been launched. Sleep of 5 seconds is not reliable. + +Running org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown +Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 9.353 sec <<< FAILURE! +testKillContainersOnShutdown(org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown) Time elapsed: 9283 sec <<< FAILURE! +junit.framework.AssertionFailedError: Did not find sigterm message + at junit.framework.Assert.fail(Assert.java:47) + at junit.framework.Assert.assertTrue(Assert.java:20) + at org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.testKillContainersOnShutdown(TestNodeManagerShutdown.java:162) + +Logs: + +2013-01-09 14:13:08,401 INFO [AsyncDispatcher event handler] container.Container (ContainerImpl.java:handle(835)) - Container container_0_0000_01_000000 transitioned from NEW to LOCALIZING +2013-01-09 14:13:08,412 INFO [AsyncDispatcher event handler] localizer.LocalizedResource (LocalizedResource.java:handle(194)) - Resource file:hadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown/tmpDir/scriptFile.sh transitioned from INIT to DOWNLOADING +2013-01-09 14:13:08,412 INFO [AsyncDispatcher event handler] localizer.ResourceLocalizationService (ResourceLocalizationService.java:handle(521)) - Created localizer for container_0_0000_01_000000 +2013-01-09 14:13:08,589 INFO [LocalizerRunner for container_0_0000_01_000000] localizer.ResourceLocalizationService (ResourceLocalizationService.java:writeCredentials(895)) - Writing credentials to the nmPrivate file hadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown/nm0/nmPrivate/container_0_0000_01_000000.tokens. Credentials list: +2013-01-09 14:13:08,628 INFO [LocalizerRunner for container_0_0000_01_000000] nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:createUserCacheDirs(373)) - Initializing user nobody +2013-01-09 14:13:08,709 INFO [main] containermanager.ContainerManagerImpl (ContainerManagerImpl.java:getContainerStatus(538)) - Returning container_id {, app_attempt_id {, application_id {, id: 0, cluster_timestamp: 0, }, attemptId: 1, }, }, state: C_RUNNING, diagnostics: "", exit_status: -1000, +2013-01-09 14:13:08,781 INFO [LocalizerRunner for container_0_0000_01_000000] nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:startLocalizer(99)) - Copying from hadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown/nm0/nmPrivate/container_0_0000_01_000000.tokens to hadoop-common/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown/nm0/usercache/nobody/appcache/application_0_0000/container_0_0000_01_000000.tokens + +
  • YARN-328. Major improvement reported by Suresh Srinivas and fixed by Suresh Srinivas (resourcemanager)
    Use token request messages defined in hadoop common
    -
    YARN changes related to HADOOP-9192 to reuse the protobuf messages defined in common. +
    YARN changes related to HADOOP-9192 to reuse the protobuf messages defined in common.
  • YARN-325. Blocker bug reported by Jason Lowe and fixed by Arun C Murthy (capacityscheduler)
    RM CapacityScheduler can deadlock when getQueueInfo() is called and a container is completing
    -
    If a client calls getQueueInfo on a parent queue (e.g.: the root queue) and containers are completing then the RM can deadlock. getQueueInfo() locks the ParentQueue and then calls the child queues' getQueueInfo() methods in turn. However when a container completes, it locks the LeafQueue then calls back into the ParentQueue. When the two mix, it's a recipe for deadlock. - +
    If a client calls getQueueInfo on a parent queue (e.g.: the root queue) and containers are completing then the RM can deadlock. getQueueInfo() locks the ParentQueue and then calls the child queues' getQueueInfo() methods in turn. However when a container completes, it locks the LeafQueue then calls back into the ParentQueue. When the two mix, it's a recipe for deadlock. + Stacktrace to follow.
  • YARN-320. Blocker bug reported by Daryn Sharp and fixed by Daryn Sharp (resourcemanager)
    @@ -121,7 +121,7 @@ Stacktrace to follow.
  • YARN-319. Major bug reported by shenhong and fixed by shenhong (resourcemanager , scheduler)
    Submit a job to a queue that not allowed in fairScheduler, client will hold forever.
    -
    RM use fairScheduler, when client submit a job to a queue, but the queue do not allow the user to submit job it, in this case, client will hold forever. +
    RM use fairScheduler, when client submit a job to a queue, but the queue do not allow the user to submit job it, in this case, client will hold forever.
  • YARN-315. Major improvement reported by Suresh Srinivas and fixed by Suresh Srinivas
    @@ -134,20 +134,20 @@ Stacktrace to follow.
  • YARN-301. Major bug reported by shenhong and fixed by shenhong (resourcemanager , scheduler)
    Fair scheduler throws ConcurrentModificationException when iterating over app's priorities
    -
    In my test cluster, fairscheduler appear to concurrentModificationException and RM crash, here is the message: - -2012-12-30 17:14:17,171 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type NODE_UPDATE to the scheduler -java.util.ConcurrentModificationException - at java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1100) - at java.util.TreeMap$KeyIterator.next(TreeMap.java:1154) - at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.assignContainer(AppSchedulable.java:297) - at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:181) - at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:780) - at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:842) - at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:98) - at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:340) - at java.lang.Thread.run(Thread.java:662) - +
    In my test cluster, fairscheduler appear to concurrentModificationException and RM crash, here is the message: + +2012-12-30 17:14:17,171 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type NODE_UPDATE to the scheduler +java.util.ConcurrentModificationException + at java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1100) + at java.util.TreeMap$KeyIterator.next(TreeMap.java:1154) + at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.assignContainer(AppSchedulable.java:297) + at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:181) + at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:780) + at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:842) + at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:98) + at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:340) + at java.lang.Thread.run(Thread.java:662) +
  • YARN-300. Major bug reported by shenhong and fixed by Sandy Ryza (resourcemanager , scheduler)
    @@ -160,8 +160,8 @@ java.util.ConcurrentModificationException
  • YARN-288. Major bug reported by Sandy Ryza and fixed by Sandy Ryza (resourcemanager , scheduler)
    Fair scheduler queue doesn't accept any jobs when ACLs are configured.
    -
    If a queue is configured with an ACL for who can submit jobs, no jobs are allowed, even if a user on the list tries. - +
    If a queue is configured with an ACL for who can submit jobs, no jobs are allowed, even if a user on the list tries. + This is caused by using the scheduler thinking the user is "yarn", because it calls UserGroupInformation.getCurrentUser() instead of UserGroupInformation.createRemoteUser() with the given user name.
  • YARN-286. Major new feature reported by Tom White and fixed by Tom White (applications)
    @@ -170,12 +170,12 @@ This is caused by using the scheduler thinking the user is "yarn", because it ca
  • YARN-285. Major improvement reported by Derek Dagit and fixed by Derek Dagit
    RM should be able to provide a tracking link for apps that have already been purged
    -
    As applications complete, the RM tracks their IDs in a completed list. This list is routinely truncated to limit the total number of application remembered by the RM. - -When a user clicks the History for a job, either the browser is redirected to the application's tracking link obtained from the stored application instance. But when the application has been purged from the RM, an error is displayed. - -In very busy clusters the rate at which applications complete can cause applications to be purged from the RM's internal list within hours, which breaks the proxy URLs users have saved for their jobs. - +
    As applications complete, the RM tracks their IDs in a completed list. This list is routinely truncated to limit the total number of application remembered by the RM. + +When a user clicks the History for a job, either the browser is redirected to the application's tracking link obtained from the stored application instance. But when the application has been purged from the RM, an error is displayed. + +In very busy clusters the rate at which applications complete can cause applications to be purged from the RM's internal list within hours, which breaks the proxy URLs users have saved for their jobs. + We would like the RM to provide valid tracking links persist so that users are not frustrated by broken links.
  • YARN-283. Major bug reported by Sandy Ryza and fixed by Sandy Ryza (scheduler)
    @@ -200,8 +200,8 @@ We would like the RM to provide valid tracking links persist so that users are n
  • YARN-272. Major bug reported by Sandy Ryza and fixed by Sandy Ryza (scheduler)
    Fair scheduler log messages try to print objects without overridden toString methods
    -
    A lot of junk gets printed out like this: - +
    A lot of junk gets printed out like this: + 2012-12-11 17:31:52,998 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerApp: Application application_1355270529654_0003 reserved container org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl@324f0f97 on node host: c1416.hal.cloudera.com:46356 #containers=7 available=0 used=8192, currently has 4 at priority org.apache.hadoop.yarn.api.records.impl.pb.PriorityPBImpl@33; currentReservation 4096
  • YARN-271. Major bug reported by Sandy Ryza and fixed by Sandy Ryza (resourcemanager , scheduler)
    @@ -218,13 +218,13 @@ We would like the RM to provide valid tracking links persist so that users are n
  • YARN-264. Major bug reported by Karthik Kambatla and fixed by Karthik Kambatla
    y.s.rm.DelegationTokenRenewer attempts to renew token even after removing an app
    -
    yarn.s.rm.security.DelegationTokenRenewer uses TimerTask/Timer. When such a timer task is canceled, already scheduled tasks run to completion. The task should check for such cancellation before running. Also, delegationTokens needs to be synchronized on all accesses. - +
    yarn.s.rm.security.DelegationTokenRenewer uses TimerTask/Timer. When such a timer task is canceled, already scheduled tasks run to completion. The task should check for such cancellation before running. Also, delegationTokens needs to be synchronized on all accesses. +
  • YARN-258. Major bug reported by Ravi Prakash and fixed by Ravi Prakash (resourcemanager)
    RM web page UI shows Invalid Date for start and finish times
    -
    Whenever the number of jobs was greater than a 100, two javascript arrays were being populated. appsData and appsTableData. appsData was winning out (because it was coming out later) and so renderHadoopDate was trying to render a <br title=""...> string. +
    Whenever the number of jobs was greater than a 100, two javascript arrays were being populated. appsData and appsTableData. appsData was winning out (because it was coming out later) and so renderHadoopDate was trying to render a <br title=""...> string.
  • YARN-254. Major improvement reported by Sandy Ryza and fixed by Sandy Ryza (resourcemanager , scheduler)
    @@ -241,7 +241,7 @@ We would like the RM to provide valid tracking links persist so that users are n
  • YARN-230. Major sub-task reported by Bikas Saha and fixed by Bikas Saha (resourcemanager)
    Make changes for RM restart phase 1
    -
    As described in YARN-128, phase 1 of RM restart puts in place mechanisms to save application state and read them back after restart. Upon restart, the NM's are asked to reboot and the previously running AM's are restarted. +
    As described in YARN-128, phase 1 of RM restart puts in place mechanisms to save application state and read them back after restart. Upon restart, the NM's are asked to reboot and the previously running AM's are restarted. After this is done, RM HA and work preserving restart can continue in parallel. For more details please refer to the design document in YARN-128
  • YARN-229. Major sub-task reported by Bikas Saha and fixed by Bikas Saha (resourcemanager)
    @@ -250,34 +250,34 @@ After this is done, RM HA and work preserving restart can continue in parallel.
  • YARN-225. Critical bug reported by Devaraj K and fixed by Devaraj K (resourcemanager)
    Proxy Link in RM UI thows NPE in Secure mode
    -
    {code:xml} -java.lang.NullPointerException - at org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:241) - at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) - at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) - at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) - at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) - at org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109) - at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) - at org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:975) - at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) - at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) - at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) - at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) - at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) - at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) - at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) - at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) - at org.mortbay.jetty.Server.handle(Server.java:326) - at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) - at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) - at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) - at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) - at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) - at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410) - at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) - - +
    {code:xml} +java.lang.NullPointerException + at org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:241) + at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) + at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) + at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) + at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) + at org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109) + at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) + at org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:975) + at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) + at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) + at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) + at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) + at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) + at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) + at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) + at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) + at org.mortbay.jetty.Server.handle(Server.java:326) + at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) + at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) + at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) + at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) + at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) + at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410) + at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) + + {code}
  • YARN-224. Major bug reported by Sandy Ryza and fixed by Sandy Ryza
    @@ -286,8 +286,8 @@ java.lang.NullPointerException
  • YARN-223. Critical bug reported by Radim Kolar and fixed by Radim Kolar
    Change processTree interface to work better with native code
    -
    Problem is that on every update of processTree new object is required. This is undesired when working with processTree implementation in native code. - +
    Problem is that on every update of processTree new object is required. This is undesired when working with processTree implementation in native code. + replace ProcessTree.getProcessTree() with updateProcessTree(). No new object allocation is needed and it simplify application code a bit.
  • YARN-222. Major improvement reported by Sandy Ryza and fixed by Sandy Ryza (resourcemanager , scheduler)
    @@ -308,46 +308,46 @@ replace ProcessTree.getProcessTree() with updateProcessTree(). No new object all
  • YARN-214. Major bug reported by Jason Lowe and fixed by Jonathan Eagles (resourcemanager)
    RMContainerImpl does not handle event EXPIRE at state RUNNING
    -
    RMContainerImpl has a race condition where a container can enter the RUNNING state just as the container expires. This results in an invalid event transition error: - -{noformat} -2012-11-11 05:31:38,954 [ResourceManager Event Processor] ERROR org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: Can't handle this event at current state -org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: EXPIRE at RUNNING - at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:301) - at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43) - at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443) - at org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:205) - at org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:44) - at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApp.containerCompleted(SchedulerApp.java:203) - at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1337) - at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainer(CapacityScheduler.java:739) - at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:659) - at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:80) - at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:340) - at java.lang.Thread.run(Thread.java:619) -{noformat} - +
    RMContainerImpl has a race condition where a container can enter the RUNNING state just as the container expires. This results in an invalid event transition error: + +{noformat} +2012-11-11 05:31:38,954 [ResourceManager Event Processor] ERROR org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: Can't handle this event at current state +org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: EXPIRE at RUNNING + at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:301) + at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43) + at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443) + at org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:205) + at org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:44) + at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApp.containerCompleted(SchedulerApp.java:203) + at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1337) + at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainer(CapacityScheduler.java:739) + at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:659) + at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:80) + at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:340) + at java.lang.Thread.run(Thread.java:619) +{noformat} + EXPIRE needs to be handled (well at least ignored) in the RUNNING state to account for this race condition.
  • YARN-212. Blocker bug reported by Nathan Roberts and fixed by Nathan Roberts (nodemanager)
    NM state machine ignores an APPLICATION_CONTAINER_FINISHED event when it shouldn't
    -
    The NM state machines can make the following two invalid state transitions when a speculative attempt is killed shortly after it gets started. When this happens the NM keeps the log aggregation context open for this application and therefore chews up FDs and leases on the NN, eventually running the NN out of FDs and bringing down the entire cluster. - - -2012-11-07 05:36:33,774 [AsyncDispatcher event handler] WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Can't handle this event at current state -org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: APPLICATION_CONTAINER_FINISHED at INITING - -2012-11-07 05:36:33,775 [AsyncDispatcher event handler] WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Can't handle this event at current state: Current: [DONE], eventType: [INIT_CONTAINER] -org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: INIT_CONTAINER at DONE - - +
    The NM state machines can make the following two invalid state transitions when a speculative attempt is killed shortly after it gets started. When this happens the NM keeps the log aggregation context open for this application and therefore chews up FDs and leases on the NN, eventually running the NN out of FDs and bringing down the entire cluster. + + +2012-11-07 05:36:33,774 [AsyncDispatcher event handler] WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Can't handle this event at current state +org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: APPLICATION_CONTAINER_FINISHED at INITING + +2012-11-07 05:36:33,775 [AsyncDispatcher event handler] WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Can't handle this event at current state: Current: [DONE], eventType: [INIT_CONTAINER] +org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: INIT_CONTAINER at DONE + +
  • YARN-206. Major bug reported by Jason Lowe and fixed by Jason Lowe (resourcemanager)
    TestApplicationCleanup.testContainerCleanup occasionally fails
    -
    testContainerCleanup is occasionally failing with the error: - -testContainerCleanup(org.apache.hadoop.yarn.server.resourcemanager.TestApplicationCleanup): expected:<2> but was:<1> +
    testContainerCleanup is occasionally failing with the error: + +testContainerCleanup(org.apache.hadoop.yarn.server.resourcemanager.TestApplicationCleanup): expected:<2> but was:<1>
  • YARN-204. Major bug reported by Aleksey Gorshkov and fixed by Aleksey Gorshkov (applications)
    @@ -356,8 +356,8 @@ testContainerCleanup(org.apache.hadoop.yarn.server.resourcemanager.TestApplicati
  • YARN-202. Critical bug reported by Kihwal Lee and fixed by Kihwal Lee
    Log Aggregation generates a storm of fsync() for namenode
    -
    When the log aggregation is on, write to each aggregated container log causes hflush() to be called. For large clusters, this can creates a lot of fsync() calls for namenode. - +
    When the log aggregation is on, write to each aggregated container log causes hflush() to be called. For large clusters, this can creates a lot of fsync() calls for namenode. + We have seen 6-7x increase in the average number of fsync operations compared to 1.0.x on a large busy cluster. Over 99% of fsync ops were for log aggregation writing to tmp files.
  • YARN-201. Critical bug reported by Jason Lowe and fixed by Jason Lowe (capacityscheduler)
    @@ -366,87 +366,87 @@ We have seen 6-7x increase in the average number of fsync operations compared to
  • YARN-189. Blocker bug reported by Thomas Graves and fixed by Thomas Graves (resourcemanager)
    deadlock in RM - AMResponse object
    -
    we ran into a deadlock in the RM. - -============================= -"1128743461@qtp-1252749669-5201": - waiting for ownable synchronizer 0x00002aabbc87b960, (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync), - which is held by "AsyncDispatcher event handler" -"AsyncDispatcher event handler": - waiting to lock monitor 0x00002ab0bba3a370 (object 0x00002aab3d4cd698, a org.apache.hadoop.yarn.api.records.impl.pb.AMResponsePBImpl), - which is held by "IPC Server handler 36 on 8030" -"IPC Server handler 36 on 8030": - waiting for ownable synchronizer 0x00002aabbc87b960, (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync), - which is held by "AsyncDispatcher event handler" -Java stack information for the threads listed above: -=================================================== -"1128743461@qtp-1252749669-5201": - at sun.misc.Unsafe.park(Native Method) - - parking to wait for <0x00002aabbc87b960> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) - at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811) - at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:941) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1261) - at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:594) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.getFinalApplicationStatus(RMAppAttemptImpl.java:2 -95) - at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.getFinalApplicationStatus(RMAppImpl.java:222) - at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getApps(RMWebServices.java:328) - at sun.reflect.GeneratedMethodAccessor41.invoke(Unknown Source) - at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) - at java.lang.reflect.Method.invoke(Method.java:597) - at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) - at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185) - at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaM -... -... -.. - - -"AsyncDispatcher event handler": - at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.unregisterAttempt(ApplicationMasterService.java:307) - - waiting to lock <0x00002aab3d4cd698> (a org.apache.hadoop.yarn.api.records.impl.pb.AMResponsePBImpl) - at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$BaseFinalTransition.transition(RMAppAttemptImpl.java:647) - at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$FinalTransition.transition(RMAppAttemptImpl.java:809) - at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$FinalTransition.transition(RMAppAttemptImpl.java:796) - at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357) - at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298) - at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43) - at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443) - - locked <0x00002aabbb673090> (a org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine) - at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:478) - at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:81) - at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:436) - at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:417) - at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:126) - at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75) - at java.lang.Thread.run(Thread.java:619) -"IPC Server handler 36 on 8030": - at sun.misc.Unsafe.park(Native Method) - - parking to wait for <0x00002aabbc87b960> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) - at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158) - at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811) - at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:842) - at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1178) - at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:807) - at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.pullJustFinishedContainers(RMAppAttemptImpl.java:437) - at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:285) - - locked <0x00002aab3d4cd698> (a org.apache.hadoop.yarn.api.records.impl.pb.AMResponsePBImpl) - at org.apache.hadoop.yarn.api.impl.pb.service.AMRMProtocolPBServiceImpl.allocate(AMRMProtocolPBServiceImpl.java:56) - at org.apache.hadoop.yarn.proto.AMRMProtocol$AMRMProtocolService$2.callBlockingMethod(AMRMProtocol.java:87) - at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Server.call(ProtoOverHadoopRpcEngine.java:353) - at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1528) - at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1524) - at java.security.AccessController.doPrivileged(Native Method) - at javax.security.auth.Subject.doAs(Subject.java:396) - at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1212) - at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1522) +
    we ran into a deadlock in the RM. + +============================= +"1128743461@qtp-1252749669-5201": + waiting for ownable synchronizer 0x00002aabbc87b960, (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync), + which is held by "AsyncDispatcher event handler" +"AsyncDispatcher event handler": + waiting to lock monitor 0x00002ab0bba3a370 (object 0x00002aab3d4cd698, a org.apache.hadoop.yarn.api.records.impl.pb.AMResponsePBImpl), + which is held by "IPC Server handler 36 on 8030" +"IPC Server handler 36 on 8030": + waiting for ownable synchronizer 0x00002aabbc87b960, (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync), + which is held by "AsyncDispatcher event handler" +Java stack information for the threads listed above: +=================================================== +"1128743461@qtp-1252749669-5201": + at sun.misc.Unsafe.park(Native Method) + - parking to wait for <0x00002aabbc87b960> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) + at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811) + at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:941) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1261) + at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:594) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.getFinalApplicationStatus(RMAppAttemptImpl.java:2 +95) + at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.getFinalApplicationStatus(RMAppImpl.java:222) + at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getApps(RMWebServices.java:328) + at sun.reflect.GeneratedMethodAccessor41.invoke(Unknown Source) + at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) + at java.lang.reflect.Method.invoke(Method.java:597) + at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) + at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185) + at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaM +... +... +.. + + +"AsyncDispatcher event handler": + at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.unregisterAttempt(ApplicationMasterService.java:307) + - waiting to lock <0x00002aab3d4cd698> (a org.apache.hadoop.yarn.api.records.impl.pb.AMResponsePBImpl) + at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$BaseFinalTransition.transition(RMAppAttemptImpl.java:647) + at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$FinalTransition.transition(RMAppAttemptImpl.java:809) + at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$FinalTransition.transition(RMAppAttemptImpl.java:796) + at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357) + at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298) + at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43) + at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443) + - locked <0x00002aabbb673090> (a org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine) + at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:478) + at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:81) + at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:436) + at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:417) + at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:126) + at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75) + at java.lang.Thread.run(Thread.java:619) +"IPC Server handler 36 on 8030": + at sun.misc.Unsafe.park(Native Method) + - parking to wait for <0x00002aabbc87b960> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) + at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158) + at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811) + at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:842) + at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1178) + at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:807) + at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.pullJustFinishedContainers(RMAppAttemptImpl.java:437) + at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:285) + - locked <0x00002aab3d4cd698> (a org.apache.hadoop.yarn.api.records.impl.pb.AMResponsePBImpl) + at org.apache.hadoop.yarn.api.impl.pb.service.AMRMProtocolPBServiceImpl.allocate(AMRMProtocolPBServiceImpl.java:56) + at org.apache.hadoop.yarn.proto.AMRMProtocol$AMRMProtocolService$2.callBlockingMethod(AMRMProtocol.java:87) + at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Server.call(ProtoOverHadoopRpcEngine.java:353) + at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1528) + at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1524) + at java.security.AccessController.doPrivileged(Native Method) + at javax.security.auth.Subject.doAs(Subject.java:396) + at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1212) + at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1522)
  • YARN-188. Major test reported by Aleksey Gorshkov and fixed by Aleksey Gorshkov (capacityscheduler)
    Coverage fixing for CapacityScheduler
    -
    some tests for CapacityScheduler -YARN-188-branch-0.23.patch patch for branch 0.23 -YARN-188-branch-2.patch patch for branch 2 -YARN-188-trunk.patch patch for trunk - +
    some tests for CapacityScheduler +YARN-188-branch-0.23.patch patch for branch 0.23 +YARN-188-branch-2.patch patch for branch 2 +YARN-188-trunk.patch patch for trunk +
  • YARN-187. Major new feature reported by Sandy Ryza and fixed by Sandy Ryza (scheduler)
    @@ -455,10 +455,10 @@ YARN-188-trunk.patch patch for trunk
  • YARN-186. Major test reported by Aleksey Gorshkov and fixed by Aleksey Gorshkov (resourcemanager , scheduler)
    Coverage fixing LinuxContainerExecutor
    -
    Added some tests for LinuxContainerExecuror -YARN-186-branch-0.23.patch patch for branch-0.23 -YARN-186-branch-2.patch patch for branch-2 -ARN-186-trunk.patch patch for trank +
    Added some tests for LinuxContainerExecuror +YARN-186-branch-0.23.patch patch for branch-0.23 +YARN-186-branch-2.patch patch for branch-2 +ARN-186-trunk.patch patch for trank
  • YARN-184. Major improvement reported by Sandy Ryza and fixed by Sandy Ryza
    @@ -475,49 +475,49 @@ ARN-186-trunk.patch patch for trank
  • YARN-180. Critical bug reported by Thomas Graves and fixed by Arun C Murthy (capacityscheduler)
    Capacity scheduler - containers that get reserved create container token to early
    -
    The capacity scheduler has the ability to 'reserve' containers. Unfortunately before it decides that it goes to reserved rather then assigned, the Container object is created which creates a container token that expires in roughly 10 minutes by default. - +
    The capacity scheduler has the ability to 'reserve' containers. Unfortunately before it decides that it goes to reserved rather then assigned, the Container object is created which creates a container token that expires in roughly 10 minutes by default. + This means that by the time the NM frees up enough space on that node for the container to move to assigned the container token may have expired.
  • YARN-179. Blocker bug reported by Vinod Kumar Vavilapalli and fixed by Vinod Kumar Vavilapalli (capacityscheduler)
    Bunch of test failures on trunk
    -
    {{CapacityScheduler.setConf()}} mandates a YarnConfiguration. It doesn't need to, throughout all of YARN, components only depend on Configuration and depend on the callers to provide correct configuration. - +
    {{CapacityScheduler.setConf()}} mandates a YarnConfiguration. It doesn't need to, throughout all of YARN, components only depend on Configuration and depend on the callers to provide correct configuration. + This is causing multiple tests to fail.
  • YARN-178. Critical bug reported by Radim Kolar and fixed by Radim Kolar
    Fix custom ProcessTree instance creation
    -
    1. In current pluggable resourcecalculatorprocesstree is not passed root process id to custom implementation making it unusable. - -2. pstree do not extend Configured as it should - +
    1. In current pluggable resourcecalculatorprocesstree is not passed root process id to custom implementation making it unusable. + +2. pstree do not extend Configured as it should + Added constructor with pid argument with testsuite. Also added test that pstree is correctly configured.
  • YARN-177. Critical bug reported by Thomas Graves and fixed by Arun C Murthy (capacityscheduler)
    CapacityScheduler - adding a queue while the RM is running has wacky results
    -
    Adding a queue to the capacity scheduler while the RM is running and then running a job in the queue added results in very strange behavior. The cluster Total Memory can either decrease or increase. We had a cluster where total memory decreased to almost 1/6th the capacity. Running on a small test cluster resulted in the capacity going up by simply adding a queue and running wordcount. - -Looking at the RM logs, used memory can go negative but other logs show the number positive: - - -2012-10-21 22:56:44,796 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: assignedContainer queue=root usedCapacity=0.0375 absoluteUsedCapacity=0.0375 used=memory: 7680 cluster=memory: 204800 - -2012-10-21 22:56:45,831 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: completedContainer queue=root usedCapacity=-0.0225 absoluteUsedCapacity=-0.0225 used=memory: -4608 cluster=memory: 204800 - +
    Adding a queue to the capacity scheduler while the RM is running and then running a job in the queue added results in very strange behavior. The cluster Total Memory can either decrease or increase. We had a cluster where total memory decreased to almost 1/6th the capacity. Running on a small test cluster resulted in the capacity going up by simply adding a queue and running wordcount. + +Looking at the RM logs, used memory can go negative but other logs show the number positive: + + +2012-10-21 22:56:44,796 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: assignedContainer queue=root usedCapacity=0.0375 absoluteUsedCapacity=0.0375 used=memory: 7680 cluster=memory: 204800 + +2012-10-21 22:56:45,831 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: completedContainer queue=root usedCapacity=-0.0225 absoluteUsedCapacity=-0.0225 used=memory: -4608 cluster=memory: 204800 +
  • YARN-170. Major bug reported by Sandy Ryza and fixed by Sandy Ryza (nodemanager)
    NodeManager stop() gets called twice on shutdown
    -
    The stop method in the NodeManager gets called twice when the NodeManager is shut down via the shutdown hook. - -The first is the stop that gets called directly by the shutdown hook. The second occurs when the NodeStatusUpdaterImpl is stopped. The NodeManager responds to the NodeStatusUpdaterImpl stop stateChanged event by stopping itself. This is so that NodeStatusUpdaterImpl can notify the NodeManager to stop, by stopping itself in response to a request from the ResourceManager - +
    The stop method in the NodeManager gets called twice when the NodeManager is shut down via the shutdown hook. + +The first is the stop that gets called directly by the shutdown hook. The second occurs when the NodeStatusUpdaterImpl is stopped. The NodeManager responds to the NodeStatusUpdaterImpl stop stateChanged event by stopping itself. This is so that NodeStatusUpdaterImpl can notify the NodeManager to stop, by stopping itself in response to a request from the ResourceManager + This could be avoided if the NodeStatusUpdaterImpl were to stop the NodeManager by calling its stop method directly.
  • YARN-169. Minor improvement reported by Anthony Rojas and fixed by Anthony Rojas (nodemanager)
    Update log4j.appender.EventCounter to use org.apache.hadoop.log.metrics.EventCounter
    -
    We should update the log4j.appender.EventCounter in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/resources/container-log4j.properties to use *org.apache.hadoop.log.metrics.EventCounter* rather than *org.apache.hadoop.metrics.jvm.EventCounter* to avoid triggering the following warning: - +
    We should update the log4j.appender.EventCounter in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/resources/container-log4j.properties to use *org.apache.hadoop.log.metrics.EventCounter* rather than *org.apache.hadoop.metrics.jvm.EventCounter* to avoid triggering the following warning: + {code}WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files{code}
  • YARN-166. Major bug reported by Thomas Graves and fixed by Thomas Graves (capacityscheduler)
    @@ -526,8 +526,8 @@ This could be avoided if the NodeStatusUpdaterImpl were to stop the NodeManager
  • YARN-165. Blocker improvement reported by Jason Lowe and fixed by Jason Lowe (resourcemanager)
    RM should point tracking URL to RM web page for app when AM fails
    -
    Currently when an ApplicationMaster fails the ResourceManager is updating the tracking URL to an empty string, see RMAppAttemptImpl.ContainerFinishedTransition. Unfortunately when the client attempts to follow the proxy URL it results in a web page showing an HTTP 500 error and an ugly backtrace because "http://" isn't a very helpful tracking URL. - +
    Currently when an ApplicationMaster fails the ResourceManager is updating the tracking URL to an empty string, see RMAppAttemptImpl.ContainerFinishedTransition. Unfortunately when the client attempts to follow the proxy URL it results in a web page showing an HTTP 500 error and an ugly backtrace because "http://" isn't a very helpful tracking URL. + It would be much more helpful if the proxy URL redirected to the RM webapp page for the specific application. That page shows the various AM attempts and pointers to their logs which will be useful for debugging the problems that caused the AM attempts to fail.
  • YARN-163. Major bug reported by Jason Lowe and fixed by Jason Lowe (nodemanager)
    @@ -540,8 +540,8 @@ It would be much more helpful if the proxy URL redirected to the RM webapp page
  • YARN-159. Major bug reported by Thomas Graves and fixed by Thomas Graves (resourcemanager)
    RM web ui applications page should be sorted to display last app first
    -
    RM web ui applications page should be sorted to display last app first. - +
    RM web ui applications page should be sorted to display last app first. + It currently sorts with smallest application id first, which is the first apps that were submitted. After you have one page worth of apps its much more useful for it to sort such that the biggest appid (last submitted app) shows up first.
  • YARN-151. Major bug reported by Robert Joseph Evans and fixed by Ravi Prakash
    @@ -562,39 +562,39 @@ It currently sorts with smallest application id first, which is the first apps t
  • YARN-140. Major bug reported by Ahmed Radwan and fixed by Ahmed Radwan (capacityscheduler)
    Add capacity-scheduler-default.xml to provide a default set of configurations for the capacity scheduler.
    -
    When setting up the capacity scheduler users are faced with problems like: - -{code} -FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error starting ResourceManager -java.lang.IllegalArgumentException: Illegal capacity of -1 for queue root -{code} - -Which basically arises from missing basic configurations, which in many cases, there is no need to explicitly provide, and a default configuration will be sufficient. For example, to address the error above, the user need to add a capacity of 100 to the root queue. - -So, we need to add a capacity-scheduler-default.xml, this will be helpful to provide the basic set of default configurations required to run the capacity scheduler. The user can still override existing configurations or provide new ones in capacity-scheduler.xml. This is similar to *-default.xml vs *-site.xml for yarn, core, mapred, hdfs, etc. - +
    When setting up the capacity scheduler users are faced with problems like: + +{code} +FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error starting ResourceManager +java.lang.IllegalArgumentException: Illegal capacity of -1 for queue root +{code} + +Which basically arises from missing basic configurations, which in many cases, there is no need to explicitly provide, and a default configuration will be sufficient. For example, to address the error above, the user need to add a capacity of 100 to the root queue. + +So, we need to add a capacity-scheduler-default.xml, this will be helpful to provide the basic set of default configurations required to run the capacity scheduler. The user can still override existing configurations or provide new ones in capacity-scheduler.xml. This is similar to *-default.xml vs *-site.xml for yarn, core, mapred, hdfs, etc. +
  • YARN-139. Major bug reported by Nathan Roberts and fixed by Vinod Kumar Vavilapalli (api)
    Interrupted Exception within AsyncDispatcher leads to user confusion
    -
    Successful applications tend to get InterruptedExceptions during shutdown. The exception is harmless but it leads to lots of user confusion and therefore could be cleaned up. - - -2012-09-28 14:50:12,477 WARN [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Interrupted Exception while stopping -java.lang.InterruptedException - at java.lang.Object.wait(Native Method) - at java.lang.Thread.join(Thread.java:1143) - at java.lang.Thread.join(Thread.java:1196) - at org.apache.hadoop.yarn.event.AsyncDispatcher.stop(AsyncDispatcher.java:105) - at org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:99) - at org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:89) - at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobFinishEventHandler.handle(MRAppMaster.java:437) - at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobFinishEventHandler.handle(MRAppMaster.java:402) - at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:126) - at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75) - at java.lang.Thread.run(Thread.java:619) -2012-09-28 14:50:12,477 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.service.AbstractService: Service:Dispatcher is stopped. -2012-09-28 14:50:12,477 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.service.AbstractService: Service:org.apache.hadoop.mapreduce.v2.app.MRAppMaster is stopped. +
    Successful applications tend to get InterruptedExceptions during shutdown. The exception is harmless but it leads to lots of user confusion and therefore could be cleaned up. + + +2012-09-28 14:50:12,477 WARN [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Interrupted Exception while stopping +java.lang.InterruptedException + at java.lang.Object.wait(Native Method) + at java.lang.Thread.join(Thread.java:1143) + at java.lang.Thread.join(Thread.java:1196) + at org.apache.hadoop.yarn.event.AsyncDispatcher.stop(AsyncDispatcher.java:105) + at org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:99) + at org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:89) + at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobFinishEventHandler.handle(MRAppMaster.java:437) + at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobFinishEventHandler.handle(MRAppMaster.java:402) + at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:126) + at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75) + at java.lang.Thread.run(Thread.java:619) +2012-09-28 14:50:12,477 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.service.AbstractService: Service:Dispatcher is stopped. +2012-09-28 14:50:12,477 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.service.AbstractService: Service:org.apache.hadoop.mapreduce.v2.app.MRAppMaster is stopped. 2012-09-28 14:50:12,477 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Exiting MR AppMaster..GoodBye
  • YARN-136. Major bug reported by Vinod Kumar Vavilapalli and fixed by Vinod Kumar Vavilapalli (resourcemanager)
    @@ -603,8 +603,8 @@ java.lang.InterruptedException
  • YARN-135. Major sub-task reported by Vinod Kumar Vavilapalli and fixed by Vinod Kumar Vavilapalli (resourcemanager)
    ClientTokens should be per app-attempt and be unregistered on App-finish.
    -
    Two issues: - - ClientTokens are per app-attempt but are created per app. +
    Two issues: + - ClientTokens are per app-attempt but are created per app. - Apps don't get unregistered from RMClientTokenSecretManager.
  • YARN-134. Major sub-task reported by Vinod Kumar Vavilapalli and fixed by Vinod Kumar Vavilapalli
    @@ -613,30 +613,30 @@ java.lang.InterruptedException
  • YARN-133. Major bug reported by Thomas Graves and fixed by Ravi Prakash (resourcemanager)
    update web services docs for RM clusterMetrics
    -
    Looks like jira https://issues.apache.org/jira/browse/MAPREDUCE-3747 added in more RM cluster metrics but the docs didn't get updated: http://hadoop.apache.org/docs/r0.23.3/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Metrics_API - - +
    Looks like jira https://issues.apache.org/jira/browse/MAPREDUCE-3747 added in more RM cluster metrics but the docs didn't get updated: http://hadoop.apache.org/docs/r0.23.3/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Metrics_API + +
  • YARN-131. Major bug reported by Ahmed Radwan and fixed by Ahmed Radwan (capacityscheduler)
    Incorrect ACL properties in capacity scheduler documentation
    -
    The CapacityScheduler apt file incorrectly specifies the property names controlling acls for application submission and queue administration. - -{{yarn.scheduler.capacity.root.<queue-path>.acl_submit_jobs}} -should be -{{yarn.scheduler.capacity.root.<queue-path>.acl_submit_applications}} - -{{yarn.scheduler.capacity.root.<queue-path>.acl_administer_jobs}} -should be -{{yarn.scheduler.capacity.root.<queue-path>.acl_administer_queue}} - +
    The CapacityScheduler apt file incorrectly specifies the property names controlling acls for application submission and queue administration. + +{{yarn.scheduler.capacity.root.<queue-path>.acl_submit_jobs}} +should be +{{yarn.scheduler.capacity.root.<queue-path>.acl_submit_applications}} + +{{yarn.scheduler.capacity.root.<queue-path>.acl_administer_jobs}} +should be +{{yarn.scheduler.capacity.root.<queue-path>.acl_administer_queue}} + Uploading a patch momentarily.
  • YARN-129. Major improvement reported by Tom White and fixed by Tom White (client)
    Simplify classpath construction for mini YARN tests
    -
    The test classpath includes a special file called 'mrapp-generated-classpath' (or similar in distributed shell) that is constructed at build time, and whose contents are a classpath with all the dependencies needed to run the tests. When the classpath for a container (e.g. the AM) is constructed the contents of mrapp-generated-classpath is read and added to the classpath, and the file itself is then added to the classpath so that later when the AM constructs a classpath for a task container it can propagate the test classpath correctly. - -This mechanism can be drastically simplified by propagating the system classpath of the current JVM (read from the java.class.path property) to a launched JVM, but only if running in the context of the mini YARN cluster. Any tests that use the mini YARN cluster will automatically work with this change. Although any that explicitly deal with mrapp-generated-classpath can be simplified. +
    The test classpath includes a special file called 'mrapp-generated-classpath' (or similar in distributed shell) that is constructed at build time, and whose contents are a classpath with all the dependencies needed to run the tests. When the classpath for a container (e.g. the AM) is constructed the contents of mrapp-generated-classpath is read and added to the classpath, and the file itself is then added to the classpath so that later when the AM constructs a classpath for a task container it can propagate the test classpath correctly. + +This mechanism can be drastically simplified by propagating the system classpath of the current JVM (read from the java.class.path property) to a launched JVM, but only if running in the context of the mini YARN cluster. Any tests that use the mini YARN cluster will automatically work with this change. Although any that explicitly deal with mrapp-generated-classpath can be simplified.
  • YARN-127. Major bug reported by Vinod Kumar Vavilapalli and fixed by Vinod Kumar Vavilapalli
    @@ -645,25 +645,25 @@ This mechanism can be drastically simplified by propagating the system classpath
  • YARN-116. Major bug reported by xieguiming and fixed by xieguiming (resourcemanager)
    RM is missing ability to add include/exclude files without a restart
    -
    The "yarn.resourcemanager.nodes.include-path" default value is "", if we need to add an include file, we must currently restart the RM. - -I suggest that for adding an include or exclude file, there should be no need to restart the RM. We may only execute the refresh command. The HDFS NameNode already has this ability. - -Fix is to the modify HostsFileReader class instances: - -From: -{code} -public HostsFileReader(String inFile, - String exFile) -{code} -To: -{code} - public HostsFileReader(Configuration conf, - String NODES_INCLUDE_FILE_PATH,String DEFAULT_NODES_INCLUDE_FILE_PATH, - String NODES_EXCLUDE_FILE_PATH,String DEFAULT_NODES_EXCLUDE_FILE_PATH) -{code} - -And thus, we can read the config file dynamically when a {{refreshNodes}} is invoked and therefore have no need to restart the ResourceManager. +
    The "yarn.resourcemanager.nodes.include-path" default value is "", if we need to add an include file, we must currently restart the RM. + +I suggest that for adding an include or exclude file, there should be no need to restart the RM. We may only execute the refresh command. The HDFS NameNode already has this ability. + +Fix is to the modify HostsFileReader class instances: + +From: +{code} +public HostsFileReader(String inFile, + String exFile) +{code} +To: +{code} + public HostsFileReader(Configuration conf, + String NODES_INCLUDE_FILE_PATH,String DEFAULT_NODES_INCLUDE_FILE_PATH, + String NODES_EXCLUDE_FILE_PATH,String DEFAULT_NODES_EXCLUDE_FILE_PATH) +{code} + +And thus, we can read the config file dynamically when a {{refreshNodes}} is invoked and therefore have no need to restart the ResourceManager.
  • YARN-103. Major improvement reported by Bikas Saha and fixed by Bikas Saha
    @@ -676,10 +676,10 @@ And thus, we can read the config file dynamically when a {{refreshNodes}} is inv
  • YARN-94. Major bug reported by Vinod Kumar Vavilapalli and fixed by Hitesh Shah (applications/distributed-shell)
    DistributedShell jar should point to Client as the main class by default
    -
    Today, it says so.. -{code} -$ $YARN_HOME/bin/yarn jar $YARN_HOME/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-$VERSION.jar -RunJar jarFile [mainClass] args... +
    Today, it says so.. +{code} +$ $YARN_HOME/bin/yarn jar $YARN_HOME/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-$VERSION.jar +RunJar jarFile [mainClass] args... {code}
  • YARN-93. Major bug reported by Jason Lowe and fixed by Jason Lowe (resourcemanager)
    @@ -688,8 +688,8 @@ RunJar jarFile [mainClass] args...
  • YARN-82. Minor bug reported by Andy Isaacson and fixed by Hemanth Yamijala (nodemanager)
    YARN local-dirs defaults to /tmp/nm-local-dir
    -
    {{yarn.nodemanager.local-dirs}} defaults to {{/tmp/nm-local-dir}}. It should be {hadoop.tmp.dir}/nm-local-dir or similar. Among other problems, this can prevent multiple test clusters from starting on the same machine. - +
    {{yarn.nodemanager.local-dirs}} defaults to {{/tmp/nm-local-dir}}. It should be {hadoop.tmp.dir}/nm-local-dir or similar. Among other problems, this can prevent multiple test clusters from starting on the same machine. + Thanks to Hemanth Yamijala for reporting this issue.
  • YARN-78. Major bug reported by Bikas Saha and fixed by Bikas Saha (applications)
    @@ -698,8 +698,8 @@ Thanks to Hemanth Yamijala for reporting this issue.
  • YARN-72. Major bug reported by Hitesh Shah and fixed by Sandy Ryza (nodemanager)
    NM should handle cleaning up containers when it shuts down
    -
    Ideally, the NM should wait for a limited amount of time when it gets a shutdown signal for existing containers to complete and kill the containers ( if we pick an aggressive approach ) after this time interval. - +
    Ideally, the NM should wait for a limited amount of time when it gets a shutdown signal for existing containers to complete and kill the containers ( if we pick an aggressive approach ) after this time interval. + For NMs which come up after an unclean shutdown, the NM should look through its directories for existing container.pids and try and kill an existing containers matching the pids found.
  • YARN-57. Major improvement reported by Radim Kolar and fixed by Radim Kolar (nodemanager)
    @@ -712,29 +712,29 @@ For NMs which come up after an unclean shutdown, the NM should look through its
  • YARN-43. Major bug reported by Thomas Graves and fixed by Thomas Graves
    TestResourceTrackerService fail intermittently on jdk7
    -
    Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.73 sec <<< FAILURE! -testDecommissionWithIncludeHosts(org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService) Time elapsed: 0.086 sec <<< FAILURE! -junit.framework.AssertionFailedError: expected:<0> but was:<1> at junit.framework.Assert.fail(Assert.java:47) - at junit.framework.Assert.failNotEquals(Assert.java:283) - at junit.framework.Assert.assertEquals(Assert.java:64) - at junit.framework.Assert.assertEquals(Assert.java:195) - at junit.framework.Assert.assertEquals(Assert.java:201) +
    Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.73 sec <<< FAILURE! +testDecommissionWithIncludeHosts(org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService) Time elapsed: 0.086 sec <<< FAILURE! +junit.framework.AssertionFailedError: expected:<0> but was:<1> at junit.framework.Assert.fail(Assert.java:47) + at junit.framework.Assert.failNotEquals(Assert.java:283) + at junit.framework.Assert.assertEquals(Assert.java:64) + at junit.framework.Assert.assertEquals(Assert.java:195) + at junit.framework.Assert.assertEquals(Assert.java:201) at org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService.testDecommissionWithIncludeHosts(TestResourceTrackerService.java:90)
  • YARN-40. Major bug reported by Devaraj K and fixed by Devaraj K (client)
    Provide support for missing yarn commands
    -
    1. status <app-id> -2. kill <app-id> (Already issue present with Id : MAPREDUCE-3793) -3. list-apps [all] +
    1. status <app-id> +2. kill <app-id> (Already issue present with Id : MAPREDUCE-3793) +3. list-apps [all] 4. nodes-report
  • YARN-33. Major bug reported by Mayank Bansal and fixed by Mayank Bansal (nodemanager)
    LocalDirsHandler should validate the configured local and log dirs
    -
    WHen yarn.nodemanager.log-dirs is with file:// URI then startup of node manager creates the directory like file:// under CWD. - -WHich should not be there. - -Thanks, +
    WHen yarn.nodemanager.log-dirs is with file:// URI then startup of node manager creates the directory like file:// under CWD. + +WHich should not be there. + +Thanks, Mayank
  • YARN-32. Major bug reported by Thomas Graves and fixed by Vinod Kumar Vavilapalli
    @@ -743,23 +743,23 @@ Mayank
  • YARN-30. Major bug reported by Thomas Graves and fixed by Thomas Graves
    TestNMWebServicesApps, TestRMWebServicesApps and TestRMWebServicesNodes fail on jdk7
    -
    It looks like the string changed from "const class" to "constant". - - -Tests run: 19, Failures: 3, Errors: 0, Skipped: 0, Time elapsed: 6.786 sec <<< FAILURE! -testNodeAppsStateInvalid(org.apache.hadoop.yarn.server.nodemanager.webapp.TestNMWebServicesApps) Time elapsed: 0.248 sec <<< FAILURE! +
    It looks like the string changed from "const class" to "constant". + + +Tests run: 19, Failures: 3, Errors: 0, Skipped: 0, Time elapsed: 6.786 sec <<< FAILURE! +testNodeAppsStateInvalid(org.apache.hadoop.yarn.server.nodemanager.webapp.TestNMWebServicesApps) Time elapsed: 0.248 sec <<< FAILURE! java.lang.AssertionError: exception message doesn't match, got: No enum constant org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationState.FOO_STATE expected: No enum const class org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationState.FOO_STATE
  • YARN-28. Major bug reported by Thomas Graves and fixed by Thomas Graves
    TestCompositeService fails on jdk7
    -
    test TestCompositeService fails when run with jdk7. - +
    test TestCompositeService fails when run with jdk7. + It appears it expects test testCallSequence to be called first and the sequence numbers to start at 0. On jdk7 its not being called first and sequence number has already been incremented.
  • YARN-23. Major improvement reported by Karthik Kambatla and fixed by Karthik Kambatla (scheduler)
    FairScheduler: FSQueueSchedulable#updateDemand() - potential redundant aggregation
    -
    In FS, FSQueueSchedulable#updateDemand() limits the demand to maxTasks only after iterating though all the pools and computing the final demand. - +
    In FS, FSQueueSchedulable#updateDemand() limits the demand to maxTasks only after iterating though all the pools and computing the final demand. + By checking if the demand has reached maxTasks in every iteration, we can avoid redundant work, at the expense of one condition check every iteration.
  • YARN-3. Major sub-task reported by Arun C Murthy and fixed by Andrew Ferguson
    @@ -808,7 +808,7 @@ By checking if the demand has reached maxTasks in every iteration, we can avoid
  • MAPREDUCE-4928. Major improvement reported by Suresh Srinivas and fixed by Suresh Srinivas (applicationmaster , security)
    Use token request messages defined in hadoop common
    -
    Protobuf message GetDelegationTokenRequestProto field renewer is made requried from optional. This change is not wire compatible with the older releases. +
    Protobuf message GetDelegationTokenRequestProto field renewer is made requried from optional. This change is not wire compatible with the older releases.
  • MAPREDUCE-4925. Major bug reported by Karthik Kambatla and fixed by Karthik Kambatla (examples)
    @@ -1269,7 +1269,7 @@ By checking if the demand has reached maxTasks in every iteration, we can avoid
  • HDFS-4369. Blocker bug reported by Suresh Srinivas and fixed by Suresh Srinivas (namenode)
    GetBlockKeysResponseProto does not handle null response
    -
    Protobuf message GetBlockKeysResponseProto member keys is made optional from required so that null values can be passed over the wire. This is an incompatible wire protocol change and does not affect the API backward compatibility. +
    Protobuf message GetBlockKeysResponseProto member keys is made optional from required so that null values can be passed over the wire. This is an incompatible wire protocol change and does not affect the API backward compatibility.
  • HDFS-4367. Blocker bug reported by Suresh Srinivas and fixed by Suresh Srinivas (namenode)
    @@ -1278,7 +1278,7 @@ By checking if the demand has reached maxTasks in every iteration, we can avoid
  • HDFS-4364. Blocker bug reported by Suresh Srinivas and fixed by Suresh Srinivas
    GetLinkTargetResponseProto does not handle null path
    -
    Protobuf message GetLinkTargetResponseProto member targetPath is made optional from required so that null values can be passed over the wire. This is an incompatible wire protocol change and does not affect the API backward compatibility. +
    Protobuf message GetLinkTargetResponseProto member targetPath is made optional from required so that null values can be passed over the wire. This is an incompatible wire protocol change and does not affect the API backward compatibility.
  • HDFS-4363. Major bug reported by Suresh Srinivas and fixed by Suresh Srinivas
    @@ -1299,11 +1299,11 @@ By checking if the demand has reached maxTasks in every iteration, we can avoid
  • HDFS-4350. Major bug reported by Andrew Wang and fixed by Andrew Wang
    Make enabling of stale marking on read and write paths independent
    -
    This patch makes an incompatible configuration change, as described below: -In releases 1.1.0 and other point releases 1.1.x, the configuration parameter "dfs.namenode.check.stale.datanode" could be used to turn on checking for the stale nodes. This configuration is no longer supported in release 1.2.0 onwards and is renamed as "dfs.namenode.avoid.read.stale.datanode". - -How feature works and configuring this feature: -As described in HDFS-3703 release notes, datanode stale period can be configured using parameter "dfs.namenode.stale.datanode.interval" in seconds (default value is 30 seconds). NameNode can be configured to use this staleness information for reads using configuration "dfs.namenode.avoid.read.stale.datanode". When this parameter is set to true, namenode picks a stale datanode as the last target to read from when returning block locations for reads. Using staleness information for writes is as described in the releases notes of HDFS-3912. +
    This patch makes an incompatible configuration change, as described below: +In releases 1.1.0 and other point releases 1.1.x, the configuration parameter "dfs.namenode.check.stale.datanode" could be used to turn on checking for the stale nodes. This configuration is no longer supported in release 1.2.0 onwards and is renamed as "dfs.namenode.avoid.read.stale.datanode". + +How feature works and configuring this feature: +As described in HDFS-3703 release notes, datanode stale period can be configured using parameter "dfs.namenode.stale.datanode.interval" in seconds (default value is 30 seconds). NameNode can be configured to use this staleness information for reads using configuration "dfs.namenode.avoid.read.stale.datanode". When this parameter is set to true, namenode picks a stale datanode as the last target to read from when returning block locations for reads. Using staleness information for writes is as described in the releases notes of HDFS-3912.
  • HDFS-4349. Major test reported by Konstantin Shvachko and fixed by Konstantin Shvachko (namenode , test)
    @@ -1576,10 +1576,10 @@ As described in HDFS-3703 release notes, datanode stale period can be configured
  • HDFS-4059. Minor sub-task reported by Jing Zhao and fixed by Jing Zhao (datanode , namenode)
    Add number of stale DataNodes to metrics
    -
    This jira adds a new metric with name "StaleDataNodes" under metrics context "dfs" of type Gauge. This tracks the number of DataNodes marked as stale. A DataNode is marked stale when the heartbeat message from the DataNode is not received within the configured time ""dfs.namenode.stale.datanode.interval". - - -Please see hdfs-default.xml documentation corresponding to ""dfs.namenode.stale.datanode.interval" for more details on how to configure this feature. When this feature is not configured, this metrics would return zero. +
    This jira adds a new metric with name "StaleDataNodes" under metrics context "dfs" of type Gauge. This tracks the number of DataNodes marked as stale. A DataNode is marked stale when the heartbeat message from the DataNode is not received within the configured time ""dfs.namenode.stale.datanode.interval". + + +Please see hdfs-default.xml documentation corresponding to ""dfs.namenode.stale.datanode.interval" for more details on how to configure this feature. When this feature is not configured, this metrics would return zero.
  • HDFS-4058. Major improvement reported by Eli Collins and fixed by Eli Collins (datanode)
    @@ -1840,9 +1840,9 @@ Please see hdfs-default.xml documentation corresponding to ""dfs.namenode.stale.
  • HDFS-3703. Major improvement reported by nkeywal and fixed by Jing Zhao (datanode , namenode)
    Decrease the datanode failure detection time
    -
    This jira adds a new DataNode state called "stale" at the NameNode. DataNodes are marked as stale if it does not send heartbeat message to NameNode within the timeout configured using the configuration parameter "dfs.namenode.stale.datanode.interval" in seconds (default value is 30 seconds). NameNode picks a stale datanode as the last target to read from when returning block locations for reads. - -This feature is by default turned * off *. To turn on the feature, set the HDFS configuration "dfs.namenode.check.stale.datanode" to true. +
    This jira adds a new DataNode state called "stale" at the NameNode. DataNodes are marked as stale if it does not send heartbeat message to NameNode within the timeout configured using the configuration parameter "dfs.namenode.stale.datanode.interval" in seconds (default value is 30 seconds). NameNode picks a stale datanode as the last target to read from when returning block locations for reads. + +This feature is by default turned * off *. To turn on the feature, set the HDFS configuration "dfs.namenode.check.stale.datanode" to true.
  • HDFS-3695. Major sub-task reported by Todd Lipcon and fixed by Todd Lipcon (ha , namenode)