Hadoop 2.1.0-beta Release Notes
These release notes include new developer and user-facing incompatibilities, features, and major improvements.
Changes since Hadoop 2.0.5-alpha
- YARN-968.
Blocker bug reported by Kihwal Lee and fixed by Vinod Kumar Vavilapalli
RM admin commands don't work
If an RM admin command is issued using CLI, I get something like following:
13/07/24 17:19:40 INFO client.RMProxy: Connecting to ResourceManager at xxxx.com/1.2.3.4:1234
refreshQueues: Unknown protocol: org.apache.hadoop.yarn.api.ResourceManagerAdministrationProtocolPB
- YARN-961.
Blocker bug reported by Omkar Vinit Joshi and fixed by Omkar Vinit Joshi
ContainerManagerImpl should enforce token on server. Today it is [TOKEN, SIMPLE]
We should only accept SecurityAuthMethod.TOKEN for ContainerManagementProtocol. Today it also accepts SIMPLE for unsecured environment.
- YARN-960.
Blocker bug reported by Alejandro Abdelnur and fixed by Daryn Sharp
TestMRCredentials and TestBinaryTokenFile are failing on trunk
Not sure, but this may be a fallout from YARN-701 and/or related to YARN-945.
Making it a blocker until full impact of the issue is scoped.
- YARN-945.
Blocker bug reported by Bikas Saha and fixed by Vinod Kumar Vavilapalli
AM register failing after AMRMToken
509 2013-07-19 15:53:55,569 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 54313: readAndProcess from client 127.0.0.1 threw exception [org.apache.hadoop.security.AccessControlException: SIMPLE authentication is not enabled. Available:[TOKEN]]
510 org.apache.hadoop.security.AccessControlException: SIMPLE authentication is not enabled. Available:[TOKEN]
511 at org.apache.hadoop.ipc.Server$Connection.initializeAuthContext(Server.java:1531)
512 at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1482)
513 at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:788)
514 at org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:587)
515 at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:562)
- YARN-937.
Blocker bug reported by Arun C Murthy and fixed by Alejandro Abdelnur
Fix unmanaged AM in non-secure/secure setup post YARN-701
Fix unmanaged AM in non-secure/secure setup post YARN-701 since app-tokens will be used in both scenarios.
- YARN-932.
Major bug reported by Sandy Ryza and fixed by Karthik Kambatla
TestResourceLocalizationService.testLocalizationInit can fail on JDK7
It looks like this is occurring when testLocalizationInit doesn't run first. Somehow yarn.nodemanager.log-dirs is getting set by one of the other tests (to ${yarn.log.dir}/userlogs), but yarn.log.dir isn't being set.
- YARN-927.
Major task reported by Bikas Saha and fixed by Bikas Saha
Change ContainerRequest to not have more than 1 container count and remove StoreContainerRequest
The downside is having to use more than 1 container request when requesting more than 1 container at * priority. For most other use cases that have specific locations we anyways need to make multiple container requests. This will also remove unnecessary duplication caused by StoredContainerRequest. It will make the getMatchingRequest() always available and easy to use removeContainerRequest().
- YARN-926.
Blocker bug reported by Vinod Kumar Vavilapalli and fixed by Jian He
ContainerManagerProtcol APIs should take in requests for multiple containers
AMs typically have to launch multiple containers on a node and the current single container APIs aren't helping. We should have all the APIs take in multiple requests and return multiple responses.
The client libraries could expose both the single and multi-container requests.
- YARN-922.
Major sub-task reported by Jian He and fixed by Jian He (resourcemanager)
Change FileSystemRMStateStore to use directories
Store each app and its attempts in the same directory so that removing application state is only one operation
- YARN-919.
Minor bug reported by Mayank Bansal and fixed by Mayank Bansal
Document setting default heap sizes in yarn env
Right now there are no defaults in yarn env scripts for resource manager nad node manager and if user wants to override that, then user has to go to documentation and find the variables and change the script.
There is no straight forward way to change it in script. Just updating the variables with defaults.
- YARN-918.
Blocker bug reported by Vinod Kumar Vavilapalli and fixed by Vinod Kumar Vavilapalli
ApplicationMasterProtocol doesn't need ApplicationAttemptId in the payload after YARN-701
Once we use AMRMToken irrespective of kerberos after YARN-701, we don't need ApplicationAttemptId in the RPC pay load. This is an API change, so doing it as a blocker for 2.1.0-beta.
- YARN-912.
Major bug reported by Bikas Saha and fixed by Mayank Bansal
Create exceptions package in common/api for yarn and move client facing exceptions to them
Exceptions like InvalidResourceBlacklistRequestException, InvalidResourceRequestException, InvalidApplicationMasterRequestException etc are currently inside ResourceManager and not visible to clients.
- YARN-909.
Minor bug reported by Chuan Liu and fixed by Chuan Liu (nodemanager)
Disable TestLinuxContainerExecutorWithMocks on Windows
This unit test tests a Linux specific feature. We should skip this unit test on Windows. A similar unit test 'TestLinuxContainerExecutor' was already skipped when running on Windows.
- YARN-897.
Blocker bug reported by Djellel Eddine Difallah and fixed by Djellel Eddine Difallah (capacityscheduler)
CapacityScheduler wrongly sorted queues
The childQueues of a ParentQueue are stored in a TreeSet where UsedCapacity defines the sort order. This ensures the queue with least UsedCapacity to receive resources next. On containerAssignment we correctly update the order, but we miss to do so on container completions. This corrupts the TreeSet structure, and under-capacity queues might starve for resources.
- YARN-894.
Minor bug reported by Chuan Liu and fixed by Chuan Liu (nodemanager)
NodeHealthScriptRunner timeout checking is inaccurate on Windows
In {{NodeHealthScriptRunner}} method, we will set HealthChecker status based on the Shell execution results. Some status are based on the exception thrown during the Shell script execution.
Currently, we will catch a non-ExitCodeException from ShellCommandExecutor, and if Shell has the timeout status set at the same time, we will also set HealthChecker status to timeout.
We have following execution sequence in Shell:
1) In main thread, schedule a delayed timer task that will kill the original process upon timeout.
2) In main thread, open a buffered reader and feed in the process's standard input stream.
3) When timeout happens, the timer task will call {{Process#destroy()}}
to kill the main process.
On Linux, when timeout happened and process killed, the buffered reader will thrown an IOException with message: "Stream closed" in main thread.
On Windows, we don't have the IOException. Only "-1" was returned from the reader that indicates the buffer is finished. As a result, the timeout status is not set on Windows, and {{TestNodeHealthService}} fails on Windows because of this.
- YARN-883.
Major improvement reported by Sandy Ryza and fixed by Sandy Ryza (scheduler)
Expose Fair Scheduler-specific queue metrics
When the Fair Scheduler is enabled, QueueMetrics should include fair share, minimum share, and maximum share.
- YARN-877.
Major sub-task reported by Junping Du and fixed by Junping Du (scheduler)
Allow for black-listing resources in FifoScheduler
YARN-750 already addressed black-list staff in YARN API and CS scheduler, this jira add implementation for FifoScheduler.
- YARN-875.
Major bug reported by Bikas Saha and fixed by Xuan Gong
Application can hang if AMRMClientAsync callback thread has exception
Currently that thread will die and then never callback. App can hang. Possible solution could be to catch Throwable in the callback and then call client.onError().
- YARN-874.
Blocker bug reported by Vinod Kumar Vavilapalli and fixed by Vinod Kumar Vavilapalli
Tracking YARN/MR test failures after HADOOP-9421 and YARN-827
HADOOP-9421 and YARN-827 broke some YARN/MR tests. Tracking those..
- YARN-873.
Major sub-task reported by Bikas Saha and fixed by Xuan Gong
YARNClient.getApplicationReport(unknownAppId) returns a null report
How can the client find out that app does not exist?
- YARN-869.
Blocker bug reported by Vinod Kumar Vavilapalli and fixed by Vinod Kumar Vavilapalli
ResourceManagerAdministrationProtocol should neither be public(yet) nor in yarn.api
This is a admin only api that we don't know yet if people can or should write new tools against. I am going to move it to yarn.server.api and make it @Private..
- YARN-866.
Major test reported by Wei Yan and fixed by Wei Yan
Add test for class ResourceWeights
Add test case for the class ResourceWeights
- YARN-865.
Major improvement reported by Xuan Gong and fixed by Xuan Gong
RM webservices can't query based on application Types
The resource manager web service api to get the list of apps doesn't have a query parameter for appTypes.
- YARN-861.
Critical bug reported by Devaraj K and fixed by Vinod Kumar Vavilapalli (nodemanager)
TestContainerManager is failing
https://builds.apache.org/job/Hadoop-Yarn-trunk/246/
{code:xml}
Running org.apache.hadoop.yarn.server.nodemanager.containermanager.TestContainerManager
Tests run: 7, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 19.249 sec <<< FAILURE!
testContainerManagerInitialization(org.apache.hadoop.yarn.server.nodemanager.containermanager.TestContainerManager) Time elapsed: 286 sec <<< FAILURE!
junit.framework.ComparisonFailure: expected:<[asf009.sp2.ygridcore.ne]t> but was:<[localhos]t>
at junit.framework.Assert.assertEquals(Assert.java:85)
{code}
- YARN-854.
Blocker bug reported by Ramya Sunil and fixed by Omkar Vinit Joshi
App submission fails on secure deploy
App submission on secure cluster fails with the following exception:
{noformat}
INFO mapreduce.Job: Job jobID failed with state FAILED due to: Application applicationID failed 2 times due to AM Container for appattemptID exited with exitCode: -1000 due to: App initialization failed (255) with output: main : command provided 0
main : user is qa_user
javax.security.sasl.SaslException: DIGEST-MD5: digest response format violation. Mismatched response. [Caused by org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): DIGEST-MD5: digest response format violation. Mismatched response.]
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
at org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.client.LocalizationProtocolPBClientImpl.heartbeat(LocalizationProtocolPBClientImpl.java:65)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.localizeFiles(ContainerLocalizer.java:235)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.runLocalization(ContainerLocalizer.java:169)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.main(ContainerLocalizer.java:348)
Caused by: org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): DIGEST-MD5: digest response format violation. Mismatched response.
at org.apache.hadoop.ipc.Client.call(Client.java:1298)
at org.apache.hadoop.ipc.Client.call(Client.java:1250)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:204)
at $Proxy7.heartbeat(Unknown Source)
at org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.client.LocalizationProtocolPBClientImpl.heartbeat(LocalizationProtocolPBClientImpl.java:62)
... 3 more
.Failing this attempt.. Failing the application.
{noformat}
- YARN-853.
Major bug reported by Devaraj K and fixed by Devaraj K (capacityscheduler)
maximum-am-resource-percent doesn't work after refreshQueues command
If we update yarn.scheduler.capacity.maximum-am-resource-percent / yarn.scheduler.capacity.<queue-path>.maximum-am-resource-percent configuration and then do the refreshNodes, it uses the new config value to calculate Max Active Applications and Max Active Application Per User. If we add new node after issuing 'rmadmin -refreshQueues' command, it uses the old maximum-am-resource-percent config value to calculate Max Active Applications and Max Active Application Per User.
- YARN-852.
Minor bug reported by Chuan Liu and fixed by Chuan Liu
TestAggregatedLogFormat.testContainerLogsFileAccess fails on Windows
The YARN unit test case fails on Windows when comparing expected message with log message in the file. The expected message constructed in the test case has two problems: 1) it uses Path.separator to concatenate path string. Path.separator is always a forward slash, which does not match the backslash used in the log message. 2) On Windows, the default file owner is Administrators group if the file is created by an Administrators user. The test expect the user to be the current user.
- YARN-851.
Major bug reported by Omkar Vinit Joshi and fixed by Omkar Vinit Joshi
Share NMTokens using NMTokenCache (api-based) instead of memory based approach which is used currently.
It is a follow up ticket for YARN-694. Changing the way NMTokens are shared.
- YARN-850.
Major sub-task reported by Jian He and fixed by Jian He
Rename getClusterAvailableResources to getAvailableResources in AMRMClients
- YARN-848.
Major bug reported by Hitesh Shah and fixed by Hitesh Shah
Nodemanager does not register with RM using the fully qualified hostname
If the hostname is misconfigured to not be fully qualified ( i.e. hostname returns foo and hostname -f returns foo.bar.xyz ), the NM ends up registering with the RM using only "foo". This can create problems if DNS cannot resolve the hostname properly.
Furthermore, HDFS uses fully qualified hostnames which can end up affecting locality matches when allocating containers based on block locations.
- YARN-846.
Major sub-task reported by Jian He and fixed by Jian He
Move pb Impl from yarn-api to yarn-common
- YARN-845.
Major sub-task reported by Arpit Gupta and fixed by Mayank Bansal (resourcemanager)
RM crash with NPE on NODE_UPDATE
the following stack trace is generated in rm
{code}
n, service: 68.142.246.147:45454 }, ] resource=<memory:1536, vCores:1> queue=default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:44544, vCores:29>usedCapacity=0.90625, absoluteUsedCapacity=0.90625, numApps=1, numContainers=29 usedCapacity=0.90625 absoluteUsedCapacity=0.90625 used=<memory:44544, vCores:29> cluster=<memory:49152, vCores:48>
2013-06-17 12:43:53,655 INFO capacity.ParentQueue (ParentQueue.java:completedContainer(696)) - completedContainer queue=root usedCapacity=0.90625 absoluteUsedCapacity=0.90625 used=<memory:44544, vCores:29> cluster=<memory:49152, vCores:48>
2013-06-17 12:43:53,656 INFO capacity.CapacityScheduler (CapacityScheduler.java:completedContainer(832)) - Application appattempt_1371448527090_0844_000001 released container container_1371448527090_0844_01_000005 on node: host: hostXX:45454 #containers=4 available=2048 used=6144 with event: FINISHED
2013-06-17 12:43:53,656 INFO capacity.CapacityScheduler (CapacityScheduler.java:nodeUpdate(661)) - Trying to fulfill reservation for application application_1371448527090_0844 on node: hostXX:45454
2013-06-17 12:43:53,656 INFO fica.FiCaSchedulerApp (FiCaSchedulerApp.java:unreserve(435)) - Application application_1371448527090_0844 unreserved on node host: hostXX:45454 #containers=4 available=2048 used=6144, currently has 4 at priority 20; currentReservation <memory:6144, vCores:4>
2013-06-17 12:43:53,656 INFO scheduler.AppSchedulingInfo (AppSchedulingInfo.java:updateResourceRequests(168)) - checking for deactivate...
2013-06-17 12:43:53,657 FATAL resourcemanager.ResourceManager (ResourceManager.java:run(422)) - Error in handling event type NODE_UPDATE to the scheduler
java.lang.NullPointerException
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.unreserve(FiCaSchedulerApp.java:432)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.unreserve(LeafQueue.java:1416)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainer(LeafQueue.java:1346)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignOffSwitchContainers(LeafQueue.java:1221)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainersOnNode(LeafQueue.java:1180)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignReservedContainer(LeafQueue.java:939)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:803)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:665)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:727)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:83)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:413)
at java.lang.Thread.run(Thread.java:662)
2013-06-17 12:43:53,659 INFO resourcemanager.ResourceManager (ResourceManager.java:run(426)) - Exiting, bbye..
2013-06-17 12:43:53,665 INFO mortbay.log (Slf4jLog.java:info(67)) - Stopped SelectChannelConnector@hostXX:8088
2013-06-17 12:43:53,765 ERROR delegation.AbstractDelegationTokenSecretManager (AbstractDelegationTokenSecretManager.java:run(513)) - InterruptedExcpetion recieved for ExpiredTokenRemover thread java.lang.InterruptedException: sleep interrupted
2013-06-17 12:43:53,766 INFO impl.MetricsSystemImpl (MetricsSystemImpl.java:stop(200)) - Stopping ResourceManager metrics system...
2013-06-17 12:43:53,767 INFO impl.MetricsSystemImpl (MetricsSystemImpl.java:stop(206)) - ResourceManager metrics system stopped.
2013-06-17 12:43:53,767 INFO impl.MetricsSystemImpl (MetricsSystemImpl.java:shutdown(572)) - ResourceManager metrics system shutdown complete.
2013-06-17 12:43:53,768 WARN amlauncher.ApplicationMasterLauncher (ApplicationMasterLauncher.java:run(98)) - org.apache.hadoop.yarn.server.resourcemanager.amlauncher.ApplicationMasterLauncher$LauncherThread interrupted. Returning.
2013-06-17 12:43:53,768 INFO ipc.Server (Server.java:stop(2167)) - Stopping server on 8033
2013-06-17 12:43:53,770 INFO ipc.Server (Server.java:run(686)) - Stopping IPC Server listener on 8033
2013-06-17 12:43:53,770 INFO ipc.Server (Server.java:stop(2167)) - Stopping server on 8032
2013-06-17 12:43:53,770 INFO ipc.Server (Server.java:run(828)) - Stopping IPC Server Responder
2013-06-17 12:43:53,771 INFO ipc.Server (Server.java:run(686)) - Stopping IPC Server listener on 8032
2013-06-17 12:43:53,771 INFO ipc.Server (Server.java:run(828)) - Stopping IPC Server Responder
2013-06-17 12:43:53,771 INFO ipc.Server (Server.java:stop(2167)) - Stopping server on 8030
2013-06-17 12:43:53,773 INFO ipc.Server (Server.java:run(686)) - Stopping IPC Server listener on 8030
2013-06-17 12:43:53,773 INFO ipc.Server (Server.java:stop(2167)) - Stopping server on 8031
2013-06-17 12:43:53,773 INFO ipc.Server (Server.java:run(828)) - Stopping IPC Server Responder
2013-06-17 12:43:53,774 INFO ipc.Server (Server.java:run(686)) - Stopping IPC Server listener on 8031
2013-06-17 12:43:53,775 INFO ipc.Server (Server.java:run(828)) - Stopping IPC Server Responder
{code}
- YARN-841.
Major sub-task reported by Siddharth Seth and fixed by Vinod Kumar Vavilapalli
Annotate and document AuxService APIs
For users writing their own AuxServices, these APIs should be annotated and need better documentation. Also, the classes may need to move out of the NodeManager.
- YARN-840.
Major sub-task reported by Jian He and fixed by Jian He
Move ProtoUtils to yarn.api.records.pb.impl
- YARN-839.
Minor bug reported by Chuan Liu and fixed by Chuan Liu
TestContainerLaunch.testContainerEnvVariables fails on Windows
The unit test case fails on Windows due to job id or container id was not printed out as part of the container script. Later, the test tries to read the pid from output of the file, and fails.
Exception in trunk:
{noformat}
Running org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.TestContainerLaunch
Tests run: 3, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 9.903 sec <<< FAILURE!
testContainerEnvVariables(org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.TestContainerLaunch) Time elapsed: 1307 sec <<< ERROR!
java.lang.NullPointerException
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.TestContainerLaunch.testContainerEnvVariables(TestContainerLaunch.java:278)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
at org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:62)
{noformat}
- YARN-837.
Major sub-task reported by Zhijie Shen and fixed by Zhijie Shen
ClusterInfo.java doesn't seem to belong to org.apache.hadoop.yarn
- YARN-834.
Blocker sub-task reported by Arun C Murthy and fixed by Zhijie Shen
Review/fix annotations for yarn-client module and clearly differentiate *Async apis
Review/fix annotations for yarn-client module
- YARN-833.
Major bug reported by Zhijie Shen and fixed by Zhijie Shen
Move Graph and VisualizeStateMachine into yarn.state package
Graph and VisualizeStateMachine are only used by state machine, they should belong to state package.
- YARN-831.
Blocker sub-task reported by Jian He and fixed by Jian He
Remove resource min from GetNewApplicationResponse
- YARN-829.
Major bug reported by Zhijie Shen and fixed by Zhijie Shen
Rename RMTokenSelector to be RMDelegationTokenSelector
Therefore, the name of it will be consistent with that of RMDelegationTokenIdentifier.
- YARN-828.
Major bug reported by Zhijie Shen and fixed by Zhijie Shen
Remove YarnVersionAnnotation
YarnVersionAnnotation is not used at all, and the version information can be accessed through YarnVersionInfo instead.
- YARN-827.
Critical sub-task reported by Bikas Saha and fixed by Jian He
Need to make Resource arithmetic methods accessible
org.apache.hadoop.yarn.server.resourcemanager.resource has stuff like Resources and Calculators that help compare/add resources etc. Without these users will be forced to replicate the logic, potentially incorrectly.
- YARN-826.
Major sub-task reported by Zhijie Shen and fixed by Zhijie Shen
Move Clock/SystemClock to util package
Clock/SystemClock should belong to util.
- YARN-825.
Blocker sub-task reported by Vinod Kumar Vavilapalli and fixed by Vinod Kumar Vavilapalli
Fix yarn-common javadoc annotations
- YARN-824.
Major sub-task reported by Jian He and fixed by Jian He
Add static factory to yarn client lib interface and change it to abstract class
Do this for AMRMClient, NMClient, YarnClient. and annotate its impl as private.
The purpose is not to expose impl
- YARN-823.
Major sub-task reported by Jian He and fixed by Jian He
Move RMAdmin from yarn.client to yarn.client.cli and rename as RMAdminCLI
- YARN-822.
Major sub-task reported by Omkar Vinit Joshi and fixed by Omkar Vinit Joshi
Rename ApplicationToken to AMRMToken
API change. At present this token is getting used on scheduler api AMRMProtocol. Right now name wise it is little confusing as it might be useful for the application to talk to complete yarn system (RM/NM) but that is not the case after YARN-694. NM will have specific NMToken so it is better to name it as AMRMToken.
- YARN-821.
Major sub-task reported by Jian He and fixed by Jian He
Rename FinishApplicationMasterRequest.setFinishApplicationStatus to setFinalApplicationStatus to be consistent with getter
- YARN-820.
Major sub-task reported by Bikas Saha and fixed by Mayank Bansal
NodeManager has invalid state transition after error in resource localization
- YARN-814.
Major sub-task reported by Hitesh Shah and fixed by Jian He
Difficult to diagnose a failed container launch when error due to invalid environment variable
The container's launch script sets up environment variables, symlinks etc.
If there is any failure when setting up the basic context ( before the actual user's process is launched ), nothing is captured by the NM. This makes it impossible to diagnose the reason for the failure.
To reproduce, set an env var where the value contains characters that throw syntax errors in bash.
- YARN-812.
Major bug reported by Ramya Sunil and fixed by Siddharth Seth
Enabling app summary logs causes 'FileNotFound' errors
RM app summary logs have been enabled as per the default config:
{noformat}
#
# Yarn ResourceManager Application Summary Log
#
# Set the ResourceManager summary log filename
yarn.server.resourcemanager.appsummary.log.file=rm-appsummary.log
# Set the ResourceManager summary log level and appender
yarn.server.resourcemanager.appsummary.logger=INFO,RMSUMMARY
# Appender for ResourceManager Application Summary Log
# Requires the following properties to be set
# - hadoop.log.dir (Hadoop Log directory)
# - yarn.server.resourcemanager.appsummary.log.file (resource manager app summary log filename)
# - yarn.server.resourcemanager.appsummary.logger (resource manager app summary log level and appender)
log4j.logger.org.apache.hadoop.yarn.server.resourcemanager.RMAppManager$ApplicationSummary=${yarn.server.resourcemanager.appsummary.logger}
log4j.additivity.org.apache.hadoop.yarn.server.resourcemanager.RMAppManager$ApplicationSummary=false
log4j.appender.RMSUMMARY=org.apache.log4j.RollingFileAppender
log4j.appender.RMSUMMARY.File=${hadoop.log.dir}/${yarn.server.resourcemanager.appsummary.log.file}
log4j.appender.RMSUMMARY.MaxFileSize=256MB
log4j.appender.RMSUMMARY.MaxBackupIndex=20
log4j.appender.RMSUMMARY.layout=org.apache.log4j.PatternLayout
log4j.appender.RMSUMMARY.layout.ConversionPattern=%d{ISO8601} %p %c{2}: %m%n
{noformat}
This however, throws errors while running commands as non-superuser:
{noformat}
-bash-4.1$ hadoop dfs -ls /
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
log4j:ERROR setFile(null,true) call failed.
java.io.FileNotFoundException: /var/log/hadoop/hadoopqa/rm-appsummary.log (No such file or directory)
at java.io.FileOutputStream.openAppend(Native Method)
at java.io.FileOutputStream.<init>(FileOutputStream.java:192)
at java.io.FileOutputStream.<init>(FileOutputStream.java:116)
at org.apache.log4j.FileAppender.setFile(FileAppender.java:294)
at org.apache.log4j.RollingFileAppender.setFile(RollingFileAppender.java:207)
at org.apache.log4j.FileAppender.activateOptions(FileAppender.java:165)
at org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:307)
at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:172)
at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:104)
at org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:842)
at org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:768)
at org.apache.log4j.PropertyConfigurator.parseCatsAndRenderers(PropertyConfigurator.java:672)
at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:516)
at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:580)
at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526)
at org.apache.log4j.LogManager.<clinit>(LogManager.java:127)
at org.apache.log4j.Logger.getLogger(Logger.java:104)
at org.apache.commons.logging.impl.Log4JLogger.getLogger(Log4JLogger.java:289)
at org.apache.commons.logging.impl.Log4JLogger.<init>(Log4JLogger.java:109)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at org.apache.commons.logging.impl.LogFactoryImpl.createLogFromClass(LogFactoryImpl.java:1116)
at org.apache.commons.logging.impl.LogFactoryImpl.discoverLogImplementation(LogFactoryImpl.java:858)
at org.apache.commons.logging.impl.LogFactoryImpl.newInstance(LogFactoryImpl.java:604)
at org.apache.commons.logging.impl.LogFactoryImpl.getInstance(LogFactoryImpl.java:336)
at org.apache.commons.logging.impl.LogFactoryImpl.getInstance(LogFactoryImpl.java:310)
at org.apache.commons.logging.LogFactory.getLog(LogFactory.java:685)
at org.apache.hadoop.fs.FsShell.<clinit>(FsShell.java:41)
Found 1 items
drwxr-xr-x - hadoop hadoop 0 2013-06-12 21:28 /user
{noformat}
- YARN-806.
Major sub-task reported by Jian He and fixed by Jian He
Move ContainerExitStatus from yarn.api to yarn.api.records
- YARN-805.
Blocker sub-task reported by Jian He and fixed by Jian He
Fix yarn-api javadoc annotations
- YARN-803.
Major improvement reported by Alejandro Abdelnur and fixed by Alejandro Abdelnur (resourcemanager , scheduler)
factor out scheduler config validation from the ResourceManager to each scheduler implementation
Per discussion in YARN-789 we should factor out from the ResourceManager class the scheduler config validations.
- YARN-799.
Major bug reported by Chris Riccomini and fixed by Chris Riccomini (nodemanager)
CgroupsLCEResourcesHandler tries to write to cgroup.procs
The implementation of
bq. ./hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/CgroupsLCEResourcesHandler.java
Tells the container-executor to write PIDs to cgroup.procs:
{code}
public String getResourcesOption(ContainerId containerId) {
String containerName = containerId.toString();
StringBuilder sb = new StringBuilder("cgroups=");
if (isCpuWeightEnabled()) {
sb.append(pathForCgroup(CONTROLLER_CPU, containerName) + "/cgroup.procs");
sb.append(",");
}
if (sb.charAt(sb.length() - 1) == ',') {
sb.deleteCharAt(sb.length() - 1);
}
return sb.toString();
}
{code}
Apparently, this file has not always been writeable:
https://patchwork.kernel.org/patch/116146/
http://lkml.indiana.edu/hypermail/linux/kernel/1004.1/00536.html
https://lists.linux-foundation.org/pipermail/containers/2009-July/019679.html
The RHEL version of the Linux kernel that I'm using has a CGroup module that has a non-writeable cgroup.procs file.
{quote}
$ uname -a
Linux criccomi-ld 2.6.32-131.4.1.el6.x86_64 #1 SMP Fri Jun 10 10:54:26 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
{quote}
As a result, when the container-executor tries to run, it fails with this error message:
bq. fprintf(LOGFILE, "Failed to write pid %s (%d) to file %s - %s\n",
This is because the executor is given a resource by the CgroupsLCEResourcesHandler that includes cgroup.procs, which is non-writeable:
{quote}
$ pwd
/cgroup/cpu/hadoop-yarn/container_1370986842149_0001_01_000001
$ ls -l
total 0
-r--r--r-- 1 criccomi eng 0 Jun 11 14:43 cgroup.procs
-rw-r--r-- 1 criccomi eng 0 Jun 11 14:43 cpu.rt_period_us
-rw-r--r-- 1 criccomi eng 0 Jun 11 14:43 cpu.rt_runtime_us
-rw-r--r-- 1 criccomi eng 0 Jun 11 14:43 cpu.shares
-rw-r--r-- 1 criccomi eng 0 Jun 11 14:43 notify_on_release
-rw-r--r-- 1 criccomi eng 0 Jun 11 14:43 tasks
{quote}
I patched CgroupsLCEResourcesHandler to use /tasks instead of /cgroup.procs, and this appears to have fixed the problem.
I can think of several potential resolutions to this ticket:
1. Ignore the problem, and make people patch YARN when they hit this issue.
2. Write to /tasks instead of /cgroup.procs for everyone
3. Check permissioning on /cgroup.procs prior to writing to it, and fall back to /tasks.
4. Add a config to yarn-site that lets admins specify which file to write to.
Thoughts?
- YARN-795.
Major bug reported by Wei Yan and fixed by Wei Yan (scheduler)
Fair scheduler queue metrics should subtract allocated vCores from available vCores
The queue metrics of fair scheduler doesn't subtract allocated vCores from available vCores, causing the available vCores returned is incorrect.
This is happening because {code}QueueMetrics.getAllocateResources(){code} doesn't return the allocated vCores.
- YARN-792.
Major sub-task reported by Jian He and fixed by Jian He
Move NodeHealthStatus from yarn.api.record to yarn.server.api.record
- YARN-791.
Blocker sub-task reported by Sandy Ryza and fixed by Sandy Ryza (api , resourcemanager)
Ensure that RM RPC APIs that return nodes are consistent with /nodes REST API
- YARN-789.
Major improvement reported by Alejandro Abdelnur and fixed by Alejandro Abdelnur (scheduler)
Enable zero capabilities resource requests in fair scheduler
Per discussion in YARN-689, reposting updated use case:
1. I have a set of services co-existing with a Yarn cluster.
2. These services run out of band from Yarn. They are not started as yarn containers and they don't use Yarn containers for processing.
3. These services use, dynamically, different amounts of CPU and memory based on their load. They manage their CPU and memory requirements independently. In other words, depending on their load, they may require more CPU but not memory or vice-versa.
By using YARN as RM for these services I'm able share and utilize the resources of the cluster appropriately and in a dynamic way. Yarn keeps tab of all the resources.
These services run an AM that reserves resources on their behalf. When this AM gets the requested resources, the services bump up their CPU/memory utilization out of band from Yarn. If the Yarn allocations are released/preempted, the services back off on their resources utilization. By doing this, Yarn and these service correctly share the cluster resources, being Yarn RM the only one that does the overall resource bookkeeping.
The services AM, not to break the lifecycle of containers, start containers in the corresponding NMs. These container processes do basically a sleep forever (i.e. sleep 10000d). They are almost not using any CPU nor memory (less than 1MB). Thus it is reasonable to assume their required CPU and memory utilization is NIL (more on hard enforcement later). Because of this almost NIL utilization of CPU and memory, it is possible to specify, when doing a request, zero as one of the dimensions (CPU or memory).
The current limitation is that the increment is also the minimum.
If we set the memory increment to 1MB. When doing a pure CPU request, we would have to specify 1MB of memory. That would work. However it would allow discretionary memory requests without a desired normalization (increments of 256, 512, etc).
If we set the CPU increment to 1CPU. When doing a pure memory request, we would have to specify 1CPU. CPU amounts a much smaller than memory amounts, and because we don't have fractional CPUs, it would mean that all my pure memory requests will be wasting 1 CPU thus reducing the overall utilization of the cluster.
Finally, on hard enforcement.
* For CPU. Hard enforcement can be done via a cgroup cpu controller. Using an absolute minimum of a few CPU shares (ie 10) in the LinuxContainerExecutor we ensure there is enough CPU cycles to run the sleep process. This absolute minimum would only kick-in if zero is allowed, otherwise will never kick in as the shares for 1 CPU are 1024.
* For Memory. Hard enforcement is currently done by the ProcfsBasedProcessTree.java, using a minimum absolute of 1 or 2 MBs would take care of zero memory resources. And again, this absolute minimum would only kick-in if zero is allowed, otherwise will never kick in as the increment memory is in several MBs if not 1GB.
- YARN-787.
Blocker sub-task reported by Alejandro Abdelnur and fixed by Alejandro Abdelnur (api)
Remove resource min from Yarn client API
Per discussions in YARN-689 and YARN-769 we should remove minimum from the API as this is a scheduler internal thing.
- YARN-782.
Critical improvement reported by Sandy Ryza and fixed by Sandy Ryza (nodemanager)
vcores-pcores ratio functions differently from vmem-pmem ratio in misleading way
The vcores-pcores ratio functions differently from the vmem-pmem ratio in the sense that the vcores-pcores ratio has an impact on allocations and the vmem-pmem ratio does not.
If I double my vmem-pmem ratio, the only change that occurs is that my containers, after being scheduled, are less likely to be killed for using too much virtual memory. But if I double my vcore-pcore ratio, my nodes will appear to the ResourceManager to contain double the amount of CPU space, which will affect scheduling decisions.
The lack of consistency will exacerbate the already difficult problem of resource configuration.
- YARN-781.
Major sub-task reported by Devaraj Das and fixed by Jian He
Expose LOGDIR that containers should use for logging
The LOGDIR is known. We should expose this to the container's environment.
- YARN-777.
Major sub-task reported by Jian He and fixed by Jian He
Remove unreferenced objects from proto
- YARN-773.
Major sub-task reported by Jian He and fixed by Jian He
Move YarnRuntimeException from package api.yarn to api.yarn.exceptions
- YARN-767.
Major bug reported by Jian He and fixed by Jian He
Initialize Application status metrics when QueueMetrics is initialized
Applications: ResourceManager.QueueMetrics.AppsSubmitted, ResourceManager.QueueMetrics.AppsRunning, ResourceManager.QueueMetrics.AppsPending, ResourceManager.QueueMetrics.AppsCompleted, ResourceManager.QueueMetrics.AppsKilled, ResourceManager.QueueMetrics.AppsFailed
For now these metrics are created only when they are needed, we want to make them be seen when QueueMetrics is initialized
- YARN-764.
Major bug reported by nemon lou and fixed by nemon lou (resourcemanager)
blank Used Resources on Capacity Scheduler page
Even when there are jobs running,used resources is empty on Capacity Scheduler page for leaf queue.(I use google-chrome on windows 7.)
After changing resource.java's toString method by replacing "<>" with "{}",this bug gets fixed.
- YARN-763.
Major bug reported by Bikas Saha and fixed by Xuan Gong
AMRMClientAsync should stop heartbeating after receiving shutdown from RM
- YARN-761.
Major bug reported by Vinod Kumar Vavilapalli and fixed by Zhijie Shen
TestNMClientAsync fails sometimes
See https://builds.apache.org/job/PreCommit-YARN-Build/1101//testReport/.
It passed on my machine though.
- YARN-760.
Major bug reported by Sandy Ryza and fixed by Niranjan Singh (nodemanager)
NodeManager throws AvroRuntimeException on failed start
NodeManager wraps exceptions that occur in its start method in AvroRuntimeExceptions, even though it doesn't use Avro anywhere else.
- YARN-759.
Major sub-task reported by Bikas Saha and fixed by Bikas Saha
Create Command enum in AllocateResponse
Use command enums for shutdown/resync instead of booleans.
- YARN-757.
Blocker bug reported by Bikas Saha and fixed by Bikas Saha
TestRMRestart failing/stuck on trunk
- YARN-756.
Major sub-task reported by Jian He and fixed by Jian He
Move PreemptionContainer/PremptionContract/PreemptionMessage/StrictPreemptionContract/PreemptionResourceRequest to api.records
- YARN-755.
Major sub-task reported by Bikas Saha and fixed by Bikas Saha
Rename AllocateResponse.reboot to AllocateResponse.resync
For work preserving rm restart the am's will be resyncing instead of rebooting. rebooting is an action that currently satisfies the resync requirement. Changing the name now so that it continues to make sense in the real resync case.
- YARN-753.
Major sub-task reported by Jian He and fixed by Jian He
Add individual factory method for api protocol records
- YARN-752.
Major improvement reported by Sandy Ryza and fixed by Sandy Ryza (api , applications)
In AMRMClient, automatically add corresponding rack requests for requested nodes
A ContainerRequest that includes node-level requests must also include matching rack-level requests for the racks that those nodes are on. When a node is present without its rack, it makes sense for the client to automatically add the node's rack.
- YARN-750.
Major sub-task reported by Arun C Murthy and fixed by Arun C Murthy
Allow for black-listing resources in YARN API and Impl in CS
YARN-392 and YARN-398 enhance scheduler api to allow for white-lists of resources.
This jira is a companion to allow for black-listing (in CS).
- YARN-749.
Major sub-task reported by Arun C Murthy and fixed by Arun C Murthy
Rename ResourceRequest (get,set)HostName to (get,set)ResourceName
We should rename ResourceRequest (get,set)HostName to (get,set)ResourceName since the name can be host, rack or *.
- YARN-748.
Major sub-task reported by Jian He and fixed by Jian He
Move BuilderUtils from yarn-common to yarn-server-common
- YARN-746.
Major sub-task reported by Steve Loughran and fixed by Steve Loughran
rename Service.register() and Service.unregister() to registerServiceListener() & unregisterServiceListener() respectively
make it clear what you are registering on a {{Service}} by naming the methods {{registerServiceListener()}} & {{unregisterServiceListener()}} respectively.
This only affects a couple of production classes; {{Service.register()}} and is used in some of the lifecycle tests of the YARN-530. There are no tests of {{Service.unregister()}}, which is something that could be corrected.
- YARN-742.
Major bug reported by Kihwal Lee and fixed by Jason Lowe (nodemanager)
Log aggregation causes a lot of redundant setPermission calls
In one of our clusters, namenode RPC is spending 45% of its time on serving setPermission calls. Further investigation has revealed that most calls are redundantly made on /mapred/logs/<user>/logs. Also mkdirs calls are made before this.
- YARN-739.
Major sub-task reported by Siddharth Seth and fixed by Omkar Vinit Joshi
NM startContainer should validate the NodeId
The NM validates certain fields from the ContainerToken on a startContainer call. It shoudl also validate the NodeId (which needs to be added to the ContianerToken).
- YARN-737.
Major sub-task reported by Jian He and fixed by Jian He
Some Exceptions no longer need to be wrapped by YarnException and can be directly thrown out after YARN-142
- YARN-736.
Major improvement reported by Sandy Ryza and fixed by Sandy Ryza (scheduler)
Add a multi-resource fair sharing metric
Currently, at a regular interval, the fair scheduler computes a fair memory share for each queue and application inside it. This fair share is not used for scheduling decisions, but is displayed in the web UI, exposed as a metric, and used for preemption decisions.
With DRF and multi-resource scheduling, assigning a memory share as the fair share metric to every queue no longer makes sense. It's not obvious what the replacement should be, but probably something like fractional fairness within a queue, or distance from an ideal cluster state.
- YARN-735.
Major sub-task reported by Jian He and fixed by Jian He
Make ApplicationAttemptID, ContainerID, NodeID immutable
- YARN-733.
Major bug reported by Zhijie Shen and fixed by Zhijie Shen
TestNMClient fails occasionally
The problem happens at:
{code}
// getContainerStatus can be called after stopContainer
try {
ContainerStatus status = nmClient.getContainerStatus(
container.getId(), container.getNodeId(),
container.getContainerToken());
assertEquals(container.getId(), status.getContainerId());
assertEquals(ContainerState.RUNNING, status.getState());
assertTrue("" + i, status.getDiagnostics().contains(
"Container killed by the ApplicationMaster."));
assertEquals(-1000, status.getExitStatus());
} catch (YarnRemoteException e) {
fail("Exception is not expected");
}
{code}
NMClientImpl#stopContainer returns, but container hasn't been stopped immediately. ContainerManangerImpl implements stopContainer in async style. Therefore, the container's status is in transition. NMClientImpl#getContainerStatus immediately after stopContainer will get either the RUNNING status or the COMPLETE one.
There will be the similar problem wrt NMClientImpl#startContainer.
- YARN-731.
Major sub-task reported by Siddharth Seth and fixed by Zhijie Shen
RPCUtil.unwrapAndThrowException should unwrap remote RuntimeExceptions
Will be required for YARN-662. Also, remote NPEs show up incorrectly for some unit tests.
- YARN-727.
Blocker sub-task reported by Siddharth Seth and fixed by Xuan Gong
ClientRMProtocol.getAllApplications should accept ApplicationType as a parameter
Now that an ApplicationType is registered on ApplicationSubmission, getAllApplications should be able to use this string to query for a specific application type.
- YARN-726.
Critical bug reported by Siddharth Seth and fixed by Mayank Bansal
Queue, FinishTime fields broken on RM UI
The queue shows up as "Invalid Date"
Finish Time shows up as a Long value.
- YARN-724.
Major sub-task reported by Jian He and fixed by Jian He
Move ProtoBase from api.records to api.records.impl.pb
Simply move ProtoBase to records.impl.pb
- YARN-720.
Major sub-task reported by Siddharth Seth and fixed by Zhijie Shen
container-log4j.properties should not refer to mapreduce properties
This refers to yarn.app.mapreduce.container.log.dir and yarn.app.mapreduce.container.log.filesize. This should either be moved into the MR codebase. Alternately the parameters should be renamed.
- YARN-719.
Major sub-task reported by Vinod Kumar Vavilapalli and fixed by Vinod Kumar Vavilapalli
Move RMIdentifier from Container to ContainerTokenIdentifier
This needs to be done for YARN-684 to happen.
- YARN-717.
Major sub-task reported by Jian He and fixed by Jian He
Copy BuilderUtil methods into token-related records
This is separated from YARN-711,as after changing yarn.api.token from interface to abstract class, eg: ClientTokenPBImpl has to extend two classes: both TokenPBImpl and ClientToken abstract class, which is not allowed in JAVA.
We may remove the ClientToken/ContainerToken/DelegationToken interface and just use the common Token interface
- YARN-716.
Major task reported by Siddharth Seth and fixed by Siddharth Seth
Make ApplicationID immutable
- YARN-715.
Major bug reported by Siddharth Seth and fixed by Vinod Kumar Vavilapalli
TestDistributedShell and TestUnmanagedAMLauncher are failing
Tests are timing out. Looks like this is related to YARN-617.
{code}
2013-05-21 17:40:23,693 ERROR [IPC Server handler 0 on 54024] containermanager.ContainerManagerImpl (ContainerManagerImpl.java:authorizeRequest(412)) - Unauthorized request to start container.
Expected containerId: user Found: container_1369183214008_0001_01_000001
2013-05-21 17:40:23,694 ERROR [IPC Server handler 0 on 54024] security.UserGroupInformation (UserGroupInformation.java:doAs(1492)) - PriviledgedActionException as:user (auth:SIMPLE) cause:org.apache.hado
Expected containerId: user Found: container_1369183214008_0001_01_000001
2013-05-21 17:40:23,695 INFO [IPC Server handler 0 on 54024] ipc.Server (Server.java:run(1864)) - IPC Server handler 0 on 54024, call org.apache.hadoop.yarn.api.ContainerManagerPB.startContainer from 10.
Expected containerId: user Found: container_1369183214008_0001_01_000001
org.apache.hadoop.yarn.exceptions.YarnRemoteException: Unauthorized request to start container.
Expected containerId: user Found: container_1369183214008_0001_01_000001
at org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:43)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.authorizeRequest(ContainerManagerImpl.java:413)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.startContainer(ContainerManagerImpl.java:440)
at org.apache.hadoop.yarn.api.impl.pb.service.ContainerManagerPBServiceImpl.startContainer(ContainerManagerPBServiceImpl.java:72)
at org.apache.hadoop.yarn.proto.ContainerManager$ContainerManagerService$2.callBlockingMethod(ContainerManager.java:83)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:527)
{code}
- YARN-714.
Major sub-task reported by Omkar Vinit Joshi and fixed by Omkar Vinit Joshi
AMRM protocol changes for sending NMToken list
NMToken will be sent to AM on allocate call if
1) AM doesn't already have NMToken for the underlying NM
2) Key rolled over on RM and AM gets new container on the same NM.
On allocate call RM will send a consolidated list of all required NMTokens.
- YARN-711.
Major sub-task reported by Vinod Kumar Vavilapalli and fixed by Jian He
Copy BuilderUtil methods into individual records
BuilderUtils is one giant utils class which has all the factory methods needed for creating records. It is painful for users to figure out how to create records. We are better off having the factories in each record, that way users can easily create records.
As a first step, we should just copy all the factory methods into individual classes, deprecate BuilderUtils and then slowly move all code off BuilderUtils.
- YARN-708.
Major task reported by Siddharth Seth and fixed by Siddharth Seth
Move RecordFactory classes to hadoop-yarn-api, miscellaneous fixes to the interfaces
This is required for additional changes in YARN-528.
Some of the interfaces could use some cleanup as well - they shouldn't be declaring YarnException (Runtime) in their signature.
- YARN-706.
Major bug reported by Zhijie Shen and fixed by Zhijie Shen
Race Condition in TestFSDownload
See the test failure in YARN-695
https://builds.apache.org/job/PreCommit-YARN-Build/957//testReport/org.apache.hadoop.yarn.util/TestFSDownload/testDownloadPatternJar/
- YARN-701.
Blocker sub-task reported by Vinod Kumar Vavilapalli and fixed by Vinod Kumar Vavilapalli
ApplicationTokens should be used irrespective of kerberos
- Single code path for secure and non-secure cases is useful for testing, coverage.
- Having this in non-secure mode will help us avoid accidental bugs in AMs DDos'ing and bringing down RM.
- YARN-700.
Major bug reported by Ivan Mitic and fixed by Ivan Mitic
TestInfoBlock fails on Windows because of line ending missmatch
Exception:
{noformat}
Running org.apache.hadoop.yarn.webapp.view.TestInfoBlock
Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.962 sec <<< FAILURE!
testMultilineInfoBlock(org.apache.hadoop.yarn.webapp.view.TestInfoBlock) Time elapsed: 873 sec <<< FAILURE!
java.lang.AssertionError:
at org.junit.Assert.fail(Assert.java:91)
at org.junit.Assert.assertTrue(Assert.java:43)
at org.junit.Assert.assertTrue(Assert.java:54)
at org.apache.hadoop.yarn.webapp.view.TestInfoBlock.testMultilineInfoBlock(TestInfoBlock.java:79)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
at org.junit.internal.runners.statements.FailOnTimeout$1.run(FailOnTimeout.java:28)
{noformat}
- YARN-695.
Major sub-task reported by Zhijie Shen and fixed by Zhijie Shen
masterContainer and status are in ApplicationReportProto but not in ApplicationReport
If masterContainer and status are no longer part of ApplicationReport, they should be removed from proto as well.
- YARN-694.
Major bug reported by Omkar Vinit Joshi and fixed by Omkar Vinit Joshi
Start using NMTokens to authenticate all communication with NM
AM uses the NMToken to authenticate all the AM-NM communication.
NM will validate NMToken in below manner
* If NMToken is using current or previous master key then the NMToken is valid. In this case it will update its cache with this key corresponding to appId.
* If NMToken is using the master key which is present in NM's cache corresponding to AM's appId then it will be validated based on this.
* If NMToken is invalid then NM will reject AM calls.
Modification for ContainerToken
* At present RPC validates AM-NM communication based on ContainerToken. It will be replaced with NMToken. Also now onwards AM will use NMToken per NM (replacing earlier behavior of ContainerToken per container per NM).
* startContainer in case of Secured environment is using ContainerToken from UGI YARN-617; however after this it will use it from the payload (Container).
* ContainerToken will exist and it will only be used to validate the AM's container start request.
- YARN-693.
Major bug reported by Omkar Vinit Joshi and fixed by Omkar Vinit Joshi
Sending NMToken to AM on allocate call
This is part of YARN-613.
As per the updated design, AM will receive per NM, NMToken in following scenarios
* AM is receiving first container on underlying NM.
* AM is receiving container on underlying NM after either NM or RM rebooted.
** After RM reboot, as RM doesn't remember (persist) the information about keys issued per AM per NM, it will reissue tokens in case AM gets new container on underlying NM. However on NM side NM will still retain older token until it receives new token to support long running jobs (in work preserving environment).
** After NM reboot, RM will delete the token information corresponding to that AM for all AMs.
* AM is receiving container on underlying NM after NMToken master key is rolled over on RM side.
In all the cases if AM receives new NMToken then it is suppose to store it for future NM communication until it receives a new one.
AMRMClient should expose these NMToken to client.
- YARN-692.
Major bug reported by Omkar Vinit Joshi and fixed by Omkar Vinit Joshi
Creating NMToken master key on RM and sharing it with NM as a part of RM-NM heartbeat.
This is related to YARN-613 . Here we will be implementing NMToken generation on RM side and sharing it with NM during RM-NM heartbeat. As a part of this JIRA mater key will only be made available to NM but there will be no validation done until AM-NM communication is fixed.
- YARN-690.
Blocker bug reported by Daryn Sharp and fixed by Daryn Sharp (resourcemanager)
RM exits on token cancel/renew problems
The DelegationTokenRenewer thread is critical to the RM. When a non-IOException occurs, the thread calls System.exit to prevent the RM from running w/o the thread. It should be exiting only on non-RuntimeExceptions.
The problem is especially bad in 23 because the yarn protobuf layer converts IOExceptions into UndeclaredThrowableExceptions (RuntimeException) which causes the renewer to abort the process. An UnknownHostException takes down the RM...
- YARN-688.
Major bug reported by Jian He and fixed by Jian He
Containers not cleaned up when NM received SHUTDOWN event from NodeStatusUpdater
Currently, both SHUTDOWN event from nodeStatusUpdater and CleanupContainers event happens to be on the same dispatcher thread, CleanupContainers Event will not be processed until SHUTDOWN event is processed. see similar problem on YARN-495.
On normal NM shutdown, this is not a problem since normal stop happens on shutdownHook thread.
- YARN-686.
Major sub-task reported by Sandy Ryza and fixed by Sandy Ryza (api)
Flatten NodeReport
The NodeReport returned by getClusterNodes or given to AMs in heartbeat responses includes both a NodeState (enum) and a NodeHealthStatus (object). As UNHEALTHY is already NodeState, a separate NodeHealthStatus doesn't seem necessary. I propose eliminating NodeHealthStatus#getIsNodeHealthy and moving its two other methods, getHealthReport and getLastHealthReportTime, into NodeReport.
- YARN-684.
Major sub-task reported by Vinod Kumar Vavilapalli and fixed by Vinod Kumar Vavilapalli
ContainerManager.startContainer needs to only have ContainerTokenIdentifier instead of the whole Container
The NM only needs the token, the whole Container is unnecessary.
- YARN-663.
Major sub-task reported by Xuan Gong and fixed by Xuan Gong
Change ResourceTracker API and LocalizationProtocol API to throw YarnRemoteException and IOException
- YARN-661.
Major bug reported by Jason Lowe and fixed by Omkar Vinit Joshi (nodemanager)
NM fails to cleanup local directories for users
YARN-71 added deletion of local directories on startup, but in practice it fails to delete the directories because of permission problems. The top-level usercache directory is owned by the user but is in a directory that is not writable by the user. Therefore the deletion of the user's usercache directory, as the user, fails due to lack of permissions.
- YARN-660.
Major sub-task reported by Bikas Saha and fixed by Bikas Saha
Improve AMRMClient with matching requests
- YARN-655.
Major bug reported by Sandy Ryza and fixed by Sandy Ryza (scheduler)
Fair scheduler metrics should subtract allocated memory from available memory
In the scheduler web UI, cluster metrics reports that the "Memory Total" goes up when an application is allocated resources.
- YARN-654.
Major bug reported by Bikas Saha and fixed by Xuan Gong
AMRMClient: Perform sanity checks for parameters of public methods
- YARN-651.
Major sub-task reported by Xuan Gong and fixed by Xuan Gong
Change ContainerManagerPBClientImpl and RMAdminProtocolPBClientImpl to throw IOException and YarnRemoteException
YARN-632 AND YARN-633 changes RMAdmin and ContainerManager api to throw YarnRemoteException and IOException. RMAdminProtocolPBClientImpl and ContainerManagerPBClientImpl should do the same changes
- YARN-648.
Major bug reported by Karthik Kambatla and fixed by Karthik Kambatla (scheduler)
FS: Add documentation for pluggable policy
YARN-469 and YARN-482 make the scheduling policy in FS pluggable. Need to add documentation on how to use this.
- YARN-646.
Major bug reported by Dapeng Sun and fixed by Dapeng Sun (documentation)
Some issues in Fair Scheduler's document
Issues are found in the doc page for Fair Scheduler http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html:
1.In the section “Configuration”, It contains two properties named “yarn.scheduler.fair.minimum-allocation-mb”, the second one should be “yarn.scheduler.fair.maximum-allocation-mb”
2.In the section “Allocation file format”, the document tells “ The format contains three types of elements”, but it lists four types of elements following that.
- YARN-645.
Major bug reported by Jian He and fixed by Jian He
Move RMDelegationTokenSecretManager from yarn-server-common to yarn-server-resourcemanager
RMDelegationTokenSecretManager is specific to resource manager, should not belong to server-common
- YARN-642.
Major bug reported by Sandy Ryza and fixed by Sandy Ryza (api , resourcemanager)
Fix up /nodes REST API to have 1 param and be consistent with the Java API
The code behind the /nodes RM REST API is unnecessarily muddled, logs the same misspelled INFO message repeatedly, and does not return unhealthy nodes, even when asked.
- YARN-639.
Major bug reported by Zhijie Shen and fixed by Zhijie Shen (applications/distributed-shell)
Make AM of Distributed Shell Use NMClient
YARN-422 adds NMClient. AM of Distributed Shell should use it instead of using ContainerManager directly.
- YARN-638.
Major sub-task reported by Jian He and fixed by Jian He (resourcemanager)
Restore RMDelegationTokens after RM Restart
This is missed in YARN-581. After RM restart, RMDelegationTokens need to be added both in DelegationTokenRenewer (addressed in YARN-581), and delegationTokenSecretManager
- YARN-637.
Major bug reported by Karthik Kambatla and fixed by Karthik Kambatla (scheduler)
FS: maxAssign is not honored
maxAssign limits the number of containers that can be assigned in a single heartbeat. Currently, FS doesn't keep track of number of assigned containers to check this.
- YARN-635.
Major sub-task reported by Xuan Gong and fixed by Siddharth Seth
Rename YarnRemoteException to YarnException
- YARN-634.
Major sub-task reported by Siddharth Seth and fixed by Siddharth Seth
Make YarnRemoteException not backed by PB and introduce a SerializedException
LocalizationProtocol sends an exception over the wire. This currently uses YarnRemoteException. Post YARN-627, this needs to be changed and a new serialized exception is required.
- YARN-633.
Major sub-task reported by Xuan Gong and fixed by Xuan Gong
Change RMAdminProtocol api to throw IOException and YarnRemoteException
- YARN-632.
Major sub-task reported by Xuan Gong and fixed by Xuan Gong
Change ContainerManager api to throw IOException and YarnRemoteException
- YARN-631.
Major sub-task reported by Xuan Gong and fixed by Xuan Gong
Change ClientRMProtocol api to throw IOException and YarnRemoteException
- YARN-630.
Major sub-task reported by Xuan Gong and fixed by Xuan Gong
Change AMRMProtocol api to throw IOException and YarnRemoteException
- YARN-629.
Major sub-task reported by Xuan Gong and fixed by Xuan Gong
Make YarnRemoteException not be rooted at IOException
After HADOOP-9343, it should be possible for YarnException to not be rooted at IOException
- YARN-628.
Major sub-task reported by Siddharth Seth and fixed by Siddharth Seth
Fix YarnException unwrapping
Unwrapping of YarnRemoteExceptions (currently in YarnRemoteExceptionPBImpl, RPCUtil post YARN-625) is broken, and often ends up throwin UndeclaredThrowableException. This needs to be fixed.
- YARN-625.
Major sub-task reported by Siddharth Seth and fixed by Siddharth Seth
Move unwrapAndThrowException from YarnRemoteExceptionPBImpl to RPCUtil
- YARN-618.
Major bug reported by Jian He and fixed by Jian He
Modify RM_INVALID_IDENTIFIER to a -ve number
RM_INVALID_IDENTIFIER set to 0 doesnt sound right as many tests set it to 0. Probably a -ve number is what we want.
- YARN-617.
Minor sub-task reported by Vinod Kumar Vavilapalli and fixed by Omkar Vinit Joshi
In unsercure mode, AM can fake resource requirements
Without security, it is impossible to completely avoid AMs faking resources. We can at the least make it as difficult as possible by using the same container tokens and the RM-NM shared key mechanism over unauthenticated RM-NM channel.
In the minimum, this will avoid accidental bugs in AMs in unsecure mode.
- YARN-615.
Major sub-task reported by Vinod Kumar Vavilapalli and fixed by Vinod Kumar Vavilapalli
ContainerLaunchContext.containerTokens should simply be called tokens
ContainerToken is the name of the specific token that AMs use to launch containers on NMs, so we should rename CLC.containerTokens to be simply tokens.
- YARN-613.
Major sub-task reported by Bikas Saha and fixed by Omkar Vinit Joshi
Create NM proxy per NM instead of per container
Currently a new NM proxy has to be created per container since the secure authentication is using a containertoken from the container.
- YARN-610.
Blocker sub-task reported by Siddharth Seth and fixed by Omkar Vinit Joshi
ClientToken (ClientToAMToken) should not be set in the environment
Similar to YARN-579, this can be set via ContainerTokens
- YARN-605.
Major bug reported by Hitesh Shah and fixed by Hitesh Shah
Failing unit test in TestNMWebServices when using git for source control
Failed tests: testNode(org.apache.hadoop.yarn.server.nodemanager.webapp.TestNMWebServices): hadoopBuildVersion doesn't match, got: 3.0.0-SNAPSHOT from fddcdcfb3cfe7dcc4f77c1ac953dd2cc0a890c62 (HEAD, origin/trunk, origin/HEAD, mrx-track) by Hitesh source checksum f89f5c9b9c9d44cf3be5c2686f2d789 expected: 3.0.0-SNAPSHOT from fddcdcfb3cfe7dcc4f77c1ac953dd2cc0a890c62 (HEAD, origin/trunk, origin/HEAD, mrx-track) by Hitesh source checksum f89f5c9b9c9d44cf3be5c2686f2d789
testNodeSlash(org.apache.hadoop.yarn.server.nodemanager.webapp.TestNMWebServices): hadoopBuildVersion doesn't match, got: 3.0.0-SNAPSHOT from fddcdcfb3cfe7dcc4f77c1ac953dd2cc0a890c62 (HEAD, origin/trunk, origin/HEAD, mrx-track) by Hitesh source checksum f89f5c9b9c9d44cf3be5c2686f2d789 expected: 3.0.0-SNAPSHOT from fddcdcfb3cfe7dcc4f77c1ac953dd2cc0a890c62 (HEAD, origin/trunk, origin/HEAD, mrx-track) by Hitesh source checksum f89f5c9b9c9d44cf3be5c2686f2d789
testNodeDefault(org.apache.hadoop.yarn.server.nodemanager.webapp.TestNMWebServices): hadoopBuildVersion doesn't match, got: 3.0.0-SNAPSHOT from fddcdcfb3cfe7dcc4f77c1ac953dd2cc0a890c62 (HEAD, origin/trunk, origin/HEAD, mrx-track) by Hitesh source checksum f89f5c9b9c9d44cf3be5c2686f2d789 expected: 3.0.0-SNAPSHOT from fddcdcfb3cfe7dcc4f77c1ac953dd2cc0a890c62 (HEAD, origin/trunk, origin/HEAD, mrx-track) by Hitesh source checksum f89f5c9b9c9d44cf3be5c2686f2d789
testNodeInfo(org.apache.hadoop.yarn.server.nodemanager.webapp.TestNMWebServices): hadoopBuildVersion doesn't match, got: 3.0.0-SNAPSHOT from fddcdcfb3cfe7dcc4f77c1ac953dd2cc0a890c62 (HEAD, origin/trunk, origin/HEAD, mrx-track) by Hitesh source checksum f89f5c9b9c9d44cf3be5c2686f2d789 expected: 3.0.0-SNAPSHOT from fddcdcfb3cfe7dcc4f77c1ac953dd2cc0a890c62 (HEAD, origin/trunk, origin/HEAD, mrx-track) by Hitesh source checksum f89f5c9b9c9d44cf3be5c2686f2d789
testNodeInfoSlash(org.apache.hadoop.yarn.server.nodemanager.webapp.TestNMWebServices): hadoopBuildVersion doesn't match, got: 3.0.0-SNAPSHOT from fddcdcfb3cfe7dcc4f77c1ac953dd2cc0a890c62 (HEAD, origin/trunk, origin/HEAD, mrx-track) by Hitesh source checksum f89f5c9b9c9d44cf3be5c2686f2d789 expected: 3.0.0-SNAPSHOT from fddcdcfb3cfe7dcc4f77c1ac953dd2cc0a890c62 (HEAD, origin/trunk, origin/HEAD, mrx-track) by Hitesh source checksum f89f5c9b9c9d44cf3be5c2686f2d789
testNodeInfoDefault(org.apache.hadoop.yarn.server.nodemanager.webapp.TestNMWebServices): hadoopBuildVersion doesn't match, got: 3.0.0-SNAPSHOT from fddcdcfb3cfe7dcc4f77c1ac953dd2cc0a890c62 (HEAD, origin/trunk, origin/HEAD, mrx-track) by Hitesh source checksum f89f5c9b9c9d44cf3be5c2686f2d789 expected: 3.0.0-SNAPSHOT from fddcdcfb3cfe7dcc4f77c1ac953dd2cc0a890c62 (HEAD, origin/trunk, origin/HEAD, mrx-track) by Hitesh source checksum f89f5c9b9c9d44cf3be5c2686f2d789
testSingleNodesXML(org.apache.hadoop.yarn.server.nodemanager.webapp.TestNMWebServices): hadoopBuildVersion doesn't match, got: 3.0.0-SNAPSHOT from fddcdcfb3cfe7dcc4f77c1ac953dd2cc0a890c62 (HEAD, origin/trunk, origin/HEAD, mrx-track) by Hitesh source checksum f89f5c9b9c9d44cf3be5c2686f2d789 expected: 3.0.0-SNAPSHOT from fddcdcfb3cfe7dcc4f77c1ac953dd2cc0a890c62 (HEAD, origin/trunk, origin/HEAD, mrx-track) by Hitesh source checksum f89f5c9b9c9d44cf3be5c2686f2d789
- YARN-600.
Major improvement reported by Sandy Ryza and fixed by Sandy Ryza (resourcemanager , scheduler)
Hook up cgroups CPU settings to the number of virtual cores allocated
YARN-3 introduced CPU isolation and monitoring through cgroups. YARN-2 and introduced CPU scheduling in the capacity scheduler, and YARN-326 will introduce it in the fair scheduler. The number of virtual cores allocated to a container should be used to weight the number of cgroups CPU shares given to it.
- YARN-599.
Major bug reported by Zhijie Shen and fixed by Zhijie Shen
Refactoring submitApplication in ClientRMService and RMAppManager
Currently, ClientRMService#submitApplication call RMAppManager#handle, and consequently call RMAppMangager#submitApplication directly, though the code looks like scheduling an APP_SUBMIT event.
In addition, the validation code before creating an RMApp instance is not well organized. Ideally, the dynamic validation, which depends on the RM's configuration, should be put in RMAppMangager#submitApplication. RMAppMangager#submitApplication is called by ClientRMService#submitApplication and RMAppMangager#recover. Since the configuration may be changed after RM restarts, the validation needs to be done again even in recovery mode. Therefore, resource request validation, which based on min/max resource limits, should be moved from ClientRMService#submitApplication to RMAppMangager#submitApplication. On the other hand, the static validation, which is independent of the RM's configuration should be put in ClientRMService#submitApplication, because it is only need to be done once during the first submission.
Furthermore, try-catch flow in RMAppMangager#submitApplication has a flaw. RMAppMangager#submitApplication has a flaw is not synchronized. If two application submissions with the same application ID enter the function, and one progresses to the completion of RMApp instantiation, and the other progresses the completion of putting the RMApp instance into rmContext, the slower submission will cause an exception due to the duplicate application ID. However, the exception will cause the RMApp instance already in rmContext (belongs to the faster submission) being rejected with the current code flow.
- YARN-598.
Major improvement reported by Sandy Ryza and fixed by Sandy Ryza (resourcemanager , scheduler)
Add virtual cores to queue metrics
QueueMetrics includes allocatedMB, availableMB, pendingMB, reservedMB. It should have equivalents for CPU.
- YARN-597.
Major bug reported by Ivan Mitic and fixed by Ivan Mitic
TestFSDownload fails on Windows because of dependencies on tar/gzip/jar tools
{{testDownloadArchive}}, {{testDownloadPatternJar}} and {{testDownloadArchiveZip}} fail with the similar Shell ExitCodeException:
{code}
testDownloadArchiveZip(org.apache.hadoop.yarn.util.TestFSDownload) Time elapsed: 480 sec <<< ERROR!
org.apache.hadoop.util.Shell$ExitCodeException: bash: line 0: cd: /D:/svn/t/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/target/TestFSDownload: No such file or directory
gzip: 1: No such file or directory
at org.apache.hadoop.util.Shell.runCommand(Shell.java:377)
at org.apache.hadoop.util.Shell.run(Shell.java:292)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:497)
at org.apache.hadoop.yarn.util.TestFSDownload.createZipFile(TestFSDownload.java:225)
at org.apache.hadoop.yarn.util.TestFSDownload.testDownloadArchiveZip(TestFSDownload.java:503)
{code}
- YARN-595.
Major sub-task reported by Sandy Ryza and fixed by Sandy Ryza (scheduler)
Refactor fair scheduler to use common Resources
resourcemanager.fair and resourcemanager.resources have two copies of basically the same code for operations on Resource objects
- YARN-594.
Major bug reported by Jian He and fixed by Jian He
Update test and add comments in YARN-534
This jira is simply to add some comments in the patch YARN-534 and update the test case
- YARN-593.
Major bug reported by Chris Nauroth and fixed by Chris Nauroth (nodemanager)
container launch on Windows does not correctly populate classpath with new process's environment variables and localized resources
On Windows, we must bundle the classpath of a launched container in an intermediate jar with a manifest. Currently, this logic incorrectly uses the nodemanager process's environment variables for substitution. Instead, it needs to use the new environment for the launched process. Also, the bundled classpath is missing some localized resources for directories, due to a quirk in the way {{File#toURI}} decides whether or not to append a trailing '/'.
- YARN-591.
Major sub-task reported by Vinod Kumar Vavilapalli and fixed by Vinod Kumar Vavilapalli
RM recovery related records do not belong to the API
We need to move out AppliationStateData and ApplicationAttemptStateData into resourcemanager module. They are not part of the public API..
- YARN-590.
Major improvement reported by Vinod Kumar Vavilapalli and fixed by Mayank Bansal
Add an optional mesage to RegisterNodeManagerResponse as to why NM is being asked to resync or shutdown
We should log such message in NM itself. Helps in debugging issues on NM directly instead of distributed debugging between RM and NM when such an action is received from RM.
- YARN-586.
Trivial bug reported by Zhijie Shen and fixed by Zhijie Shen
Typo in ApplicationSubmissionContext#setApplicationId
The parameter should be applicationId instead of appplicationId
- YARN-585.
Major bug reported by Zhijie Shen and fixed by Zhijie Shen
TestFairScheduler#testNotAllowSubmitApplication is broken due to YARN-514
TestFairScheduler#testNotAllowSubmitApplication is broken due to YARN-514. See the discussions in YARN-514.
- YARN-583.
Major sub-task reported by Omkar Vinit Joshi and fixed by Omkar Vinit Joshi
Application cache files should be localized under local-dir/usercache/userid/appcache/appid/filecache
Currently application cache files are getting localized under local-dir/usercache/userid/appcache/appid/. however they should be localized under filecache sub directory.
- YARN-582.
Major sub-task reported by Bikas Saha and fixed by Jian He (resourcemanager)
Restore appToken and clientToken for app attempt after RM restart
These need to be saved and restored on a per app attempt basis. This is required only when work preserving restart is implemented for secure clusters. In non-preserving restart app attempts are killed and so this does not matter.
- YARN-581.
Major sub-task reported by Bikas Saha and fixed by Jian He (resourcemanager)
Test and verify that app delegation tokens are added to tokenRenewer after RM restart
The code already saves the delegation tokens in AppSubmissionContext. Upon restart the AppSubmissionContext is used to submit the application again and so restores the delegation tokens. This jira tracks testing and verifying this functionality in a secure setup.
- YARN-579.
Major sub-task reported by Vinod Kumar Vavilapalli and fixed by Vinod Kumar Vavilapalli
Make ApplicationToken part of Container's token list to help RM-restart
Container is already persisted for helping RM restart. Instead of explicitly setting ApplicationToken in AM's env, if we change it to be in Container, we can avoid env and can also help restart.
- YARN-578.
Major sub-task reported by Vinod Kumar Vavilapalli and fixed by Omkar Vinit Joshi (nodemanager)
NodeManager should use SecureIOUtils for serving and aggregating logs
Log servlets for serving logs and the ShuffleService for serving intermediate outputs both should use SecureIOUtils for avoiding symlink attacks.
- YARN-577.
Major sub-task reported by Hitesh Shah and fixed by Hitesh Shah
ApplicationReport does not provide progress value of application
An application sends its progress % to the RM via AllocateRequest. This should be able to be retrieved by a client via the ApplicationReport.
- YARN-576.
Major bug reported by Hitesh Shah and fixed by Kenji Kikushima
RM should not allow registrations from NMs that do not satisfy minimum scheduler allocations
If the minimum resource allocation configured for the RM scheduler is 1 GB, the RM should drop all NMs that register with a total capacity of less than 1 GB.
- YARN-571.
Major sub-task reported by Hitesh Shah and fixed by Omkar Vinit Joshi
User should not be part of ContainerLaunchContext
Today, a user is expected to set the user name in the CLC when either submitting an application or launching a container from the AM. This does not make sense as the user can/has been identified by the RM as part of the RPC layer.
Solution would be to move the user information into either the Container object or directly into the ContainerToken which can then be used by the NM to launch the container. This user information would set into the container by the RM.
- YARN-569.
Major sub-task reported by Carlo Curino and fixed by Carlo Curino (capacityscheduler)
CapacityScheduler: support for preemption (using a capacity monitor)
There is a tension between the fast-pace reactive role of the CapacityScheduler, which needs to respond quickly to
applications resource requests, and node updates, and the more introspective, time-based considerations
needed to observe and correct for capacity balance. To this purpose we opted instead of hacking the delicate
mechanisms of the CapacityScheduler directly to add support for preemption by means of a "Capacity Monitor",
which can be run optionally as a separate service (much like the NMLivelinessMonitor).
The capacity monitor (similarly to equivalent functionalities in the fairness scheduler) operates running on intervals
(e.g., every 3 seconds), observe the state of the assignment of resources to queues from the capacity scheduler,
performs off-line computation to determine if preemption is needed, and how best to "edit" the current schedule to
improve capacity, and generates events that produce four possible actions:
# Container de-reservations
# Resource-based preemptions
# Container-based preemptions
# Container killing
The actions listed above are progressively more costly, and it is up to the policy to use them as desired to achieve the rebalancing goals.
Note that due to the "lag" in the effect of these actions the policy should operate at the macroscopic level (e.g., preempt tens of containers
from a queue) and not trying to tightly and consistently micromanage container allocations.
------------- Preemption policy (ProportionalCapacityPreemptionPolicy): -------------
Preemption policies are by design pluggable, in the following we present an initial policy (ProportionalCapacityPreemptionPolicy) we have been experimenting with. The ProportionalCapacityPreemptionPolicy behaves as follows:
# it gathers from the scheduler the state of the queues, in particular, their current capacity, guaranteed capacity and pending requests (*)
# if there are pending requests from queues that are under capacity it computes a new ideal balanced state (**)
# it computes the set of preemptions needed to repair the current schedule and achieve capacity balance (accounting for natural completion rates, and
respecting bounds on the amount of preemption we allow for each round)
# it selects which applications to preempt from each over-capacity queue (the last one in the FIFO order)
# it remove reservations from the most recently assigned app until the amount of resource to reclaim is obtained, or until no more reservations exits
# (if not enough) it issues preemptions for containers from the same applications (reverse chronological order, last assigned container first) again until necessary or until no containers except the AM container are left,
# (if not enough) it moves onto unreserve and preempt from the next application.
# containers that have been asked to preempt are tracked across executions. If a containers is among the one to be preempted for more than a certain time, the container is moved in a the list of containers to be forcibly killed.
Notes:
(*) at the moment, in order to avoid double-counting of the requests, we only look at the "ANY" part of pending resource requests, which means we might not preempt on behalf of AMs that ask only for specific locations but not any.
(**) The ideal balance state is one in which each queue has at least its guaranteed capacity, and the spare capacity is distributed among queues (that wants some) as a weighted fair share. Where the weighting is based on the guaranteed capacity of a queue, and the function runs to a fix point.
Tunables of the ProportionalCapacityPreemptionPolicy:
# observe-only mode (i.e., log the actions it would take, but behave as read-only)
# how frequently to run the policy
# how long to wait between preemption and kill of a container
# which fraction of the containers I would like to obtain should I preempt (has to do with the natural rate at which containers are returned)
# deadzone size, i.e., what % of over-capacity should I ignore (if we are off perfect balance by some small % we ignore it)
# overall amount of preemption we can afford for each run of the policy (in terms of total cluster capacity)
In our current experiments this set of tunables seem to be a good start to shape the preemption action properly. More sophisticated preemption policies could take into account different type of applications running, job priorities, cost of preemption, integral of capacity imbalance. This is very much a control-theory kind of problem, and some of the lessons on designing and tuning controllers are likely to apply.
Generality:
The monitor-based scheduler edit, and the preemption mechanisms we introduced here are designed to be more general than enforcing capacity/fairness, in fact, we are considering other monitors that leverage the same idea of "schedule edits" to target different global properties (e.g., allocate enough resources to guarantee deadlines for important jobs, or data-locality optimizations, IO-balancing among nodes, etc...).
Note that by default the preemption policy we describe is disabled in the patch.
Depends on YARN-45 and YARN-567, is related to YARN-568
- YARN-568.
Major improvement reported by Carlo Curino and fixed by Carlo Curino (scheduler)
FairScheduler: support for work-preserving preemption
In the attached patch, we modified the FairScheduler to substitute its preemption-by-killling with a work-preserving version of preemption (followed by killing if the AMs do not respond quickly enough). This should allows to run preemption checking more often, but kill less often (proper tuning to be investigated). Depends on YARN-567 and YARN-45, is related to YARN-569.
- YARN-567.
Major sub-task reported by Carlo Curino and fixed by Carlo Curino (resourcemanager)
RM changes to support preemption for FairScheduler and CapacityScheduler
A common tradeoff in scheduling jobs is between keeping the cluster busy and enforcing capacity/fairness properties. FairScheduler and CapacityScheduler takes opposite stance on how to achieve this.
The FairScheduler, leverages task-killing to quickly reclaim resources from currently running jobs and redistributing them among new jobs, thus keeping the cluster busy but waste useful work. The CapacityScheduler is typically tuned
to limit the portion of the cluster used by each queue so that the likelihood of violating capacity is low, thus never wasting work, but risking to keep the cluster underutilized or have jobs waiting to obtain their rightful capacity.
By introducing the notion of a work-preserving preemption we can remove this tradeoff. This requires a protocol for preemption (YARN-45), and ApplicationMasters that can answer to preemption efficiently (e.g., by saving their intermediate state, this will be posted for MapReduce in a separate JIRA soon), together with a scheduler that can issues preemption requests (discussed in separate JIRAs YARN-568 and YARN-569).
The changes we track with this JIRA are common to FairScheduler and CapacityScheduler, and are mostly propagation of preemption decisions through the ApplicationMastersService.
- YARN-563.
Major sub-task reported by Thomas Weise and fixed by Mayank Bansal
Add application type to ApplicationReport
This field is needed to distinguish different types of applications (app master implementations). For example, we may run applications of type XYZ in a cluster alongside MR and would like to filter applications by type.
- YARN-562.
Major sub-task reported by Jian He and fixed by Jian He (resourcemanager)
NM should reject containers allocated by previous RM
Its possible that after RM shutdown, before AM goes down,AM still call startContainer on NM with containers allocated by previous RM. When RM comes back, NM doesn't know whether this container launch request comes from previous RM or the current RM. we should reject containers allocated by previous RM
- YARN-561.
Major sub-task reported by Hitesh Shah and fixed by Xuan Gong
Nodemanager should set some key information into the environment of every container that it launches.
Information such as containerId, nodemanager hostname, nodemanager port is not set in the environment when any container is launched.
For an AM, the RM does all of this for it but for a container launched by an application, all of the above need to be set by the ApplicationMaster.
At the minimum, container id would be a useful piece of information. If the container wishes to talk to its local NM, the nodemanager related information would also come in handy.
- YARN-557.
Major bug reported by Chris Nauroth and fixed by Chris Nauroth (applications)
TestUnmanagedAMLauncher fails on Windows
{{TestUnmanagedAMLauncher}} fails on Windows due to attempting to run a Unix-specific command in distributed shell and use of a Unix-specific environment variable to determine username for the {{ContainerLaunchContext}}.
- YARN-553.
Minor sub-task reported by Harsh J and fixed by Karthik Kambatla (client)
Have YarnClient generate a directly usable ApplicationSubmissionContext
Right now, we're doing multiple steps to create a relevant ApplicationSubmissionContext for a pre-received GetNewApplicationResponse.
{code}
GetNewApplicationResponse newApp = yarnClient.getNewApplication();
ApplicationId appId = newApp.getApplicationId();
ApplicationSubmissionContext appContext = Records.newRecord(ApplicationSubmissionContext.class);
appContext.setApplicationId(appId);
{code}
A simplified way may be to have the GetNewApplicationResponse itself provide a helper method that builds a usable ApplicationSubmissionContext for us. Something like:
{code}
GetNewApplicationResponse newApp = yarnClient.getNewApplication();
ApplicationSubmissionContext appContext = newApp.generateApplicationSubmissionContext();
{code}
[The above method can also take an arg for the container launch spec, or perhaps pre-load defaults like min-resource, etc. in the returned object, aside of just associating the application ID automatically.]
- YARN-549.
Major sub-task reported by Zhijie Shen and fixed by Zhijie Shen
YarnClient.submitApplication should wait for application to be accepted by the RM
Currently, when submitting an application, storeApplication will be called for recovery. However, it is a blocking API, and is likely to block concurrent application submissions. Therefore, it is good to make application submission asynchronous, and postpone storeApplication. YarnClient needs to change to wait for the whole operation to complete so that clients can be notified after the application is really submitted. YarnClient needs to wait for application to reach SUBMITTED state or beyond.
- YARN-548.
Major sub-task reported by Vadim Bondarev and fixed by Vadim Bondarev
Add tests for YarnUncaughtExceptionHandler
- YARN-547.
Major sub-task reported by Omkar Vinit Joshi and fixed by Omkar Vinit Joshi
Race condition in Public / Private Localizer may result into resource getting downloaded again
Public Localizer :
At present when multiple containers try to request a localized resource
* If the resource is not present then first it is created and Resource Localization starts ( LocalizedResource is in DOWNLOADING state)
* Now if in this state multiple ResourceRequestEvents arrive then ResourceLocalizationEvents are sent for all of them.
Most of the times it is not resulting into a duplicate resource download but there is a race condition present there. Inside ResourceLocalization (for public download) all the requests are added to local attempts map. If a new request comes in then first it is checked in this map before a new download starts for the same. For the current download the request will be there in the map. Now if a same resource request comes in then it will rejected (i.e. resource is getting downloaded already). However if the current download completes then the request will be removed from this local map. Now after this removal if the LocalizerRequestEvent comes in then as it is not present in local map the resource will be downloaded again.
PrivateLocalizer :
Here a different but similar race condition is present.
* Here inside findNextResource method call; each LocalizerRunner tries to grab a lock on LocalizerResource. If the lock is not acquired then it will keep trying until the resource state changes to LOCALIZED. This lock will be released by the LocalizerRunner when download completes.
* Now if another ContainerLocalizer tries to grab the lock on a resource before LocalizedResource state changes to LOCALIZED then resource will be downloaded again.
At both the places the root cause of this is that all the threads try to acquire the lock on resource however current state of the LocalizedResource is not taken into consideration.
- YARN-542.
Major bug reported by Vinod Kumar Vavilapalli and fixed by Zhijie Shen
Change the default global AM max-attempts value to be not one
Today, the global AM max-attempts is set to 1 which is a bad choice. AM max-attempts accounts for both AM level failures as well as container crashes due to localization issue, lost nodes etc. To account for AM crashes due to problems that are not caused by user code, mainly lost nodes, we want to give AMs some retires.
I propose we change it to atleast two. Can change it to 4 to match other retry-configs.
- YARN-541.
Blocker bug reported by Krishna Kishore Bonagiri and fixed by Bikas Saha (resourcemanager)
getAllocatedContainers() is not returning all the allocated containers
I am running an application that was written and working well with the hadoop-2.0.0-alpha but when I am running the same against 2.0.3-alpha, the getAllocatedContainers() method called on AMResponse is not returning all the containers allocated sometimes. For example, I request for 10 containers and this method gives me only 9 containers sometimes, and when I looked at the log of Resource Manager, the 10th container is also allocated. It happens only sometimes randomly and works fine all other times. If I send one more request for the remaining container to RM after it failed to give them the first time(and before releasing already acquired ones), it could allocate that container. I am running only one application at a time, but 1000s of them one after another.
My main worry is, even though the RM's log is saying that all 10 requested containers are allocated, the getAllocatedContainers() method is not returning me all of them, it returned only 9 surprisingly. I never saw this kind of issue in the previous version, i.e. hadoop-2.0.0-alpha.
Thanks,
Kishore
- YARN-539.
Major sub-task reported by Omkar Vinit Joshi and fixed by Omkar Vinit Joshi
LocalizedResources are leaked in memory in case resource localization fails
If resource localization fails then resource remains in memory and is
1) Either cleaned up when next time cache cleanup runs and there is space crunch. (If sufficient space in cache is available then it will remain in memory).
2) reused if LocalizationRequest comes again for the same resource.
I think when resource localization fails then that event should be sent to LocalResourceTracker which will then remove it from its cache.
- YARN-538.
Major improvement reported by Sandy Ryza and fixed by Sandy Ryza
RM address DNS lookup can cause unnecessary slowness on every JHS page load
When I run the job history server locally, every page load takes in the 10s of seconds. I profiled the process and discovered that all the extra time was spent inside YarnConfiguration#getRMWebAppURL, trying to resolve 0.0.0.0 to a hostname. When I changed my yarn.resourcemanager.address to localhost, the page load times decreased drastically.
There's no that we need to perform this resolution on every page load.
- YARN-536.
Major sub-task reported by Xuan Gong and fixed by Xuan Gong
Remove ContainerStatus, ContainerState from Container api interface as they will not be called by the container object
Remove containerstate, containerStatus from container interface. They will not be called by container object
- YARN-534.
Major sub-task reported by Jian He and fixed by Jian He (resourcemanager)
AM max attempts is not checked when RM restart and try to recover attempts
Currently,AM max attempts is only checked if the current attempt fails and check to see whether to create new attempt. If the RM restarts before the max-attempt fails, it'll not clean the state store, when RM comes back, it will retry attempt again.
- YARN-532.
Major bug reported by Siddharth Seth and fixed by Siddharth Seth
RMAdminProtocolPBClientImpl should implement Closeable
Required for RPC.stopProxy to work. Already done in most of the other protocols. (MAPREDUCE-5117 addressing the one other protocol missing this)
- YARN-530.
Major sub-task reported by Steve Loughran and fixed by Steve Loughran
Define Service model strictly, implement AbstractService for robust subclassing, migrate yarn-common services
# Extend the YARN {{Service}} interface as discussed in YARN-117
# Implement the changes in {{AbstractService}} and {{FilterService}}.
# Migrate all services in yarn-common to the more robust service model, test.
- YARN-525.
Major improvement reported by Thomas Graves and fixed by Thomas Graves (capacityscheduler)
make CS node-locality-delay refreshable
the config yarn.scheduler.capacity.node-locality-delay doesn't change when you change the value in capacity_scheduler.xml and then run yarn rmadmin -refreshQueues.
- YARN-523.
Major sub-task reported by Vinod Kumar Vavilapalli and fixed by Jian He
Container localization failures aren't reported from NM to RM
This is mainly a pain on crashing AMs, but once we fix this, containers also can benefit - same fix for both.
- YARN-521.
Major sub-task reported by Sandy Ryza and fixed by Sandy Ryza (api)
Augment AM - RM client module to be able to request containers only at specific locations
When YARN-392 and YARN-398 are completed, it would be good for AMRMClient to offer an easy way to access their functionality
- YARN-518.
Major improvement reported by Dapeng Sun and fixed by Sandy Ryza (documentation)
Fair Scheduler's document link could be added to the hadoop 2.x main doc page
Currently the doc page for Fair Scheduler looks good and it’s here, http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html.
It would be better to add the document link to the YARN section in the Hadoop 2.x main doc page, so that users can easily find the doc to experimentally try Fair Scheduler as Capacity Scheduler.
- YARN-515.
Blocker bug reported by Robert Joseph Evans and fixed by Robert Joseph Evans
Node Manager not getting the master key
On branch-2 the latest version I see the following on a secure cluster.
{noformat}
2013-03-28 19:21:06,243 [main] INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Security enabled - updating secret keys now
2013-03-28 19:21:06,243 [main] INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered with ResourceManager as RM:PORT with total resource of <me
mory:12288, vCores:16>
2013-03-28 19:21:06,244 [main] INFO org.apache.hadoop.yarn.service.AbstractService: Service:org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl is started.
2013-03-28 19:21:06,245 [main] INFO org.apache.hadoop.yarn.service.AbstractService: Service:org.apache.hadoop.yarn.server.nodemanager.NodeManager is started.
2013-03-28 19:21:07,257 [Node Status Updater] ERROR org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Caught exception in status-updater
java.lang.NullPointerException
at org.apache.hadoop.yarn.server.security.BaseContainerTokenSecretManager.getCurrentKey(BaseContainerTokenSecretManager.java:121)
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:407)
{noformat}
The Null pointer exception just keeps repeating and all of the nodes end up being lost. It looks like it never gets the secret key when it registers.
- YARN-514.
Major sub-task reported by Bikas Saha and fixed by Zhijie Shen (resourcemanager)
Delayed store operations should not result in RM unavailability for app submission
Currently, app submission is the only store operation performed synchronously because the app must be stored before the request returns with success. This makes the RM susceptible to blocking all client threads on slow store operations, resulting in RM being perceived as unavailable by clients.
- YARN-513.
Major sub-task reported by Bikas Saha and fixed by Jian He (resourcemanager)
Create common proxy client for communicating with RM
When the RM is restarting, the NM, AM and Clients should wait for some time for the RM to come back up.
- YARN-512.
Minor bug reported by Jason Lowe and fixed by Maysam Yabandeh (nodemanager)
Log aggregation root directory check is more expensive than it needs to be
The log aggregation root directory check first does an {{exists}} call followed by a {{getFileStatus}} call. That effectively stats the file twice. It should just use {{getFileStatus}} and catch {{FileNotFoundException}} to handle the non-existent case.
In addition we may consider caching the presence of the directory rather than checking it each time a node aggregates logs for an application.
- YARN-507.
Minor bug reported by Karthik Kambatla and fixed by Karthik Kambatla (scheduler)
Add interface visibility and stability annotations to FS interfaces/classes
Many of FS classes/interfaces are missing annotations on visibility and stability.
- YARN-506.
Major bug reported by Ivan Mitic and fixed by Ivan Mitic
Move to common utils FileUtil#setReadable/Writable/Executable and FileUtil#canRead/Write/Execute
Move to common utils described in HADOOP-9413 that work well cross-platform.
- YARN-500.
Major bug reported by Nishan Shetty and fixed by Kenji Kikushima (resourcemanager)
ResourceManager webapp is using next port if configured port is already in use
- YARN-496.
Minor bug reported by Sandy Ryza and fixed by Sandy Ryza (scheduler)
Fair scheduler configs are refreshed inconsistently in reinitialize
When FairScheduler#reinitialize is called, some of the scheduler-wide configs are refreshed and others aren't. They should all be refreshed.
Ones that are refreshed: userAsDefaultQueue, nodeLocalityThreshold, rackLocalityThreshold, preemptionEnabled
Ones that aren't: minimumAllocation, maximumAllocation, assignMultiple, maxAssign
- YARN-495.
Major bug reported by Jian He and fixed by Jian He
Change NM behavior of reboot to resync
When a reboot command is sent from RM, the node manager doesn't clean up the containers while its stopping.
- YARN-493.
Major bug reported by Chris Nauroth and fixed by Chris Nauroth (nodemanager)
NodeManager job control logic flaws on Windows
Both product and test code contain some platform-specific assumptions, such as availability of bash for executing a command in a container and signals to check existence of a process and terminate it.
- YARN-491.
Major bug reported by Chris Nauroth and fixed by Chris Nauroth (nodemanager)
TestContainerLogsPage fails on Windows
{{TestContainerLogsPage}} contains some code for initializing a log directory that doesn't work correctly on Windows.
- YARN-490.
Major bug reported by Chris Nauroth and fixed by Chris Nauroth (applications/distributed-shell)
TestDistributedShell fails on Windows
There are a few platform-specific assumption in distributed shell (both main code and test code) that prevent it from working correctly on Windows.
- YARN-488.
Major bug reported by Chris Nauroth and fixed by Chris Nauroth (nodemanager)
TestContainerManagerSecurity fails on Windows
These tests are failing to launch containers correctly when running on Windows.
- YARN-487.
Major bug reported by Chris Nauroth and fixed by Chris Nauroth (nodemanager)
TestDiskFailures fails on Windows due to path mishandling
{{TestDiskFailures#testDirFailuresOnStartup}} fails due to insertion of an extra leading '/' on the path within {{LocalDirsHandlerService}} when running on Windows. The test assertions also fail to account for the fact that {{Path}} normalizes '\' to '/'.
- YARN-486.
Major sub-task reported by Bikas Saha and fixed by Xuan Gong
Change startContainer NM API to accept Container as a parameter and make ContainerLaunchContext user land
Currently, id, resource request etc need to be copied over from Container to ContainerLaunchContext. This can be brittle. Also it leads to duplication of information (such as Resource from CLC and Resource from Container and Container.tokens). Sending Container directly to startContainer solves these problems. It also makes CLC clean by only having stuff in it that it set by the client/AM.
- YARN-485.
Major bug reported by Karthik Kambatla and fixed by Karthik Kambatla
TestProcfsProcessTree#testProcessTree() doesn't wait long enough for the process to die
TestProcfsProcessTree#testProcessTree fails occasionally with the following stack trace
{noformat}
Stack Trace:
junit.framework.AssertionFailedError: expected:<false> but was:<true>
at org.apache.hadoop.util.TestProcfsBasedProcessTree.testProcessTree(TestProcfsBasedProcessTree.java)
{noformat}
kill -9 is executed asynchronously, the signal is delivered when the process comes out of the kernel (sys call). Checking if the process died immediately after can fail at times.
- YARN-482.
Major sub-task reported by Karthik Kambatla and fixed by Karthik Kambatla (scheduler)
FS: Extend SchedulingMode to intermediate queues
FS allows setting {{SchedulingMode}} for leaf queues. Extending this to non-leaf queues allows using different kinds of fairness: e.g., root can have three child queues - fair-mem, drf-cpu-mem, drf-cpu-disk-mem taking different number of resources into account. In turn, this allows users to decide on the scheduling latency vs sophistication of the scheduling mode.
- YARN-481.
Major bug reported by Chris Riccomini and fixed by Chris Riccomini (client)
Add AM Host and RPC Port to ApplicationCLI Status Output
Hey Guys,
I noticed that the ApplicationCLI is just randomly not printing some of the values in the ApplicationReport. I've added the getHost and getRpcPort. These are useful for me, since I want to make an RPC call to the AM (not the tracker call).
Thanks!
Chris
- YARN-479.
Major bug reported by Hitesh Shah and fixed by Jian He
NM retry behavior for connection to RM should be similar for lost heartbeats
Regardless of connection loss at the start or at an intermediate point, NM's retry behavior to the RM should follow the same flow.
- YARN-476.
Minor bug reported by Jason Lowe and fixed by Sandy Ryza
ProcfsBasedProcessTree info message confuses users
ProcfsBasedProcessTree has a habit of emitting not-so-helpful messages such as the following:
{noformat}
2013-03-13 12:41:51,957 INFO [communication thread] org.apache.hadoop.yarn.util.ProcfsBasedProcessTree: The process 28747 may have finished in the interim.
2013-03-13 12:41:51,958 INFO [communication thread] org.apache.hadoop.yarn.util.ProcfsBasedProcessTree: The process 28978 may have finished in the interim.
2013-03-13 12:41:51,958 INFO [communication thread] org.apache.hadoop.yarn.util.ProcfsBasedProcessTree: The process 28979 may have finished in the interim.
{noformat}
As described in MAPREDUCE-4570, this is something that naturally occurs in the process of monitoring processes via procfs. It's uninteresting at best and can confuse users who think it's a reason their job isn't running as expected when it appears in their logs.
We should either make this DEBUG or remove it entirely.
- YARN-475.
Major sub-task reported by Hitesh Shah and fixed by Hitesh Shah
Remove ApplicationConstants.AM_APP_ATTEMPT_ID_ENV as it is no longer set in an AM's environment
AMs are expected to use ApplicationConstants.AM_CONTAINER_ID_ENV and derive the application attempt id from the container id.
- YARN-474.
Major bug reported by Hitesh Shah and fixed by Zhijie Shen (capacityscheduler)
CapacityScheduler does not activate applications when maximum-am-resource-percent configuration is refreshed
Submit 3 applications to a cluster where capacity scheduler limits allow only 1 running application. Modify capacity scheduler config to increase value of yarn.scheduler.capacity.maximum-am-resource-percent and invoke refresh queues.
The 2 applications not yet in running state do not get launched even though limits are increased.
- YARN-469.
Major sub-task reported by Karthik Kambatla and fixed by Karthik Kambatla (scheduler)
Make scheduling mode in FS pluggable
Currently, scheduling mode in FS is limited to Fair and FIFO. The code typically has an if condition at multiple places to determine the correct course of action.
Making the scheduling mode pluggable helps in simplifying this process, particularly as we add new modes (DRF in this case).
- YARN-468.
Major sub-task reported by Aleksey Gorshkov and fixed by Aleksey Gorshkov
coverage fix for org.apache.hadoop.yarn.server.webproxy.amfilter
coverage fix org.apache.hadoop.yarn.server.webproxy.amfilter
patch YARN-468-trunk.patch for trunk, branch-2, branch-0.23
- YARN-467.
Major sub-task reported by Omkar Vinit Joshi and fixed by Omkar Vinit Joshi (nodemanager)
Jobs fail during resource localization when public distributed-cache hits unix directory limits
If we have multiple jobs which uses distributed cache with small size of files, the directory limit reaches before reaching the cache size and fails to create any directories in file cache (PUBLIC). The jobs start failing with the below exception.
java.io.IOException: mkdir of /tmp/nm-local-dir/filecache/3901886847734194975 failed
at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:909)
at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:143)
at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:706)
at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:703)
at org.apache.hadoop.fs.FileContext$FSLinkResolver.resolve(FileContext.java:2325)
at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:703)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:147)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
we need to have a mechanism where in we can create directory hierarchy and limit number of files per directory.
- YARN-460.
Blocker bug reported by Thomas Graves and fixed by Thomas Graves (capacityscheduler)
CS user left in list of active users for the queue even when application finished
We have seen a user get left in the queues list of active users even though the application was removed. This can cause everyone else in the queue to get less resources if using the minimum user limit percent config.
- YARN-458.
Major bug reported by Sandy Ryza and fixed by Sandy Ryza (nodemanager , resourcemanager)
YARN daemon addresses must be placed in many different configs
The YARN resourcemanager's address is included in four different configs: yarn.resourcemanager.scheduler.address, yarn.resourcemanager.resource-tracker.address, yarn.resourcemanager.address, and yarn.resourcemanager.admin.address
A new user trying to configure a cluster needs to know the names of all these four configs.
The same issue exists for nodemanagers.
It would be much easier if they could simply specify yarn.resourcemanager.hostname and yarn.nodemanager.hostname and default ports for the other ones would kick in.
- YARN-450.
Major sub-task reported by Bikas Saha and fixed by Zhijie Shen
Define value for * in the scheduling protocol
The ResourceRequest has a string field to specify node/rack locations. For the cross-rack/cluster-wide location (ie when there is no locality constraint) the "*" string is used everywhere. However, its not defined anywhere and each piece of code either defines a local constant or uses the string literal. Defining "*" in the protocol and removing other local references from the code base will be good.
- YARN-448.
Major bug reported by Kihwal Lee and fixed by Kihwal Lee (nodemanager)
Remove unnecessary hflush from log aggregation
AggregatedLogFormat#writeVersion() calls hflush() after writing the version. Calling hflush does not seem to be necessary. It can add a lot of load to hdfs in a big busy cluster.
- YARN-447.
Minor improvement reported by nemon lou and fixed by nemon lou (scheduler)
applicationComparator improvement for CS
Now the compare code is :
return a1.getApplicationId().getId() - a2.getApplicationId().getId();
Will be replaced with :
return a1.getApplicationId().compareTo(a2.getApplicationId());
This will bring some benefits:
1,leave applicationId compare logic to ApplicationId class;
2,In future's HA mode,cluster time stamp may change,ApplicationId class already takes care of this condition.
- YARN-444.
Major sub-task reported by Sandy Ryza and fixed by Sandy Ryza (api , applications/distributed-shell)
Move special container exit codes from YarnConfiguration to API
YarnConfiguration currently contains the special container exit codes INVALID_CONTAINER_EXIT_STATUS = -1000, ABORTED_CONTAINER_EXIT_STATUS = -100, and DISKS_FAILED = -101.
These are not really not really related to configuration, and YarnConfiguration should not become a place to put miscellaneous constants.
Per discussion on YARN-417, appmaster writers need to be able to provide special handling for them, so it might make sense to move these to their own user-facing class.
- YARN-441.
Major sub-task reported by Siddharth Seth and fixed by Xuan Gong
Clean up unused collection methods in various APIs
There's a bunch of unused methods like getAskCount() and getAsk(index) in AllocateRequest, and other interfaces. These should be removed.
In YARN, found them in. MR will have it's own set.
AllocateRequest
StartContaienrResponse
- YARN-440.
Major sub-task reported by Siddharth Seth and fixed by Xuan Gong
Flatten RegisterNodeManagerResponse
RegisterNodeManagerResponse has another wrapper RegistrationResponse under it, which can be removed.
- YARN-439.
Major sub-task reported by Siddharth Seth and fixed by Xuan Gong
Flatten NodeHeartbeatResponse
NodeheartbeatResponse has another wrapper HeartbeatResponse under it, which can be removed.
- YARN-426.
Critical bug reported by Jason Lowe and fixed by Jason Lowe (nodemanager)
Failure to download a public resource on a node prevents further downloads of the resource from that node
If the NM encounters an error while downloading a public resource, it fails to empty the list of request events corresponding to the resource request in {{attempts}}. If the same public resource is subsequently requested on that node, {{PublicLocalizer.addResource}} will skip the download since it will mistakenly believe a download of that resource is already in progress. At that point any container that requests the public resource will just hang in the {{LOCALIZING}} state.
- YARN-422.
Major sub-task reported by Bikas Saha and fixed by Zhijie Shen
Add NM client library
Create a simple wrapper over the ContainerManager protocol to provide hide the details of the protocol implementation.
- YARN-417.
Major sub-task reported by Sandy Ryza and fixed by Sandy Ryza (api , applications)
Create AMRMClient wrapper that provides asynchronous callbacks
Writing AMs would be easier for some if they did not have to handle heartbeating to the RM on their own.
- YARN-412.
Minor bug reported by Roger Hoover and fixed by Roger Hoover (scheduler)
FifoScheduler incorrectly checking for node locality
In the FifoScheduler, the assignNodeLocalContainers method is checking if the data is local to a node by searching for the nodeAddress of the node in the set of outstanding requests for the app. This seems to be incorrect as it should be checking hostname instead. The offending line of code is 455:
application.getResourceRequest(priority, node.getRMNode().getNodeAddress());
Requests are formated by hostname (e.g. host1.foo.com) whereas node addresses are a concatenation of hostname and command port (e.g. host1.foo.com:1234)
In the CapacityScheduler, it's done using hostname. See LeafQueue.assignNodeLocalContainers, line 1129
application.getResourceRequest(priority, node.getHostName());
Note that this bug does not affect the actual scheduling decisions made by the FifoScheduler because even though it incorrect determines that a request is not local to the node, it will still schedule the request immediately because it's rack-local. However, this bug may be adversely affecting the reporting of job status by underreporting the number of tasks that were node local.
- YARN-410.
Major bug reported by Vinod Kumar Vavilapalli and fixed by Omkar Vinit Joshi
New lines in diagnostics for a failed app on the per-application page make it hard to read
We need to fix the following issues on YARN web-UI:
- Remove the "Note" column from the application list. When a failure happens, this "Note" spoils the table layout.
- When the Application is still not running, the Tracking UI should be title "UNASSIGNED", for some reason it is titled "ApplicationMaster" but (correctly) links to "#".
- The per-application page has all the RM related information like version, start-time etc. Must be some accidental change by one of the patches.
- The diagnostics for a failed app on the per-application page don't retain new lines and wrap'em around - looks hard to read.
- YARN-406.
Minor improvement reported by Hitesh Shah and fixed by Hitesh Shah
TestRackResolver fails when local network resolves "host1" to a valid host
- YARN-400.
Critical bug reported by Jason Lowe and fixed by Jason Lowe (resourcemanager)
RM can return null application resource usage report leading to NPE in client
RMAppImpl.createAndGetApplicationReport can return a report with a null resource usage report if full access to the app is allowed but the application has no current attempt. This leads to NPEs in client code that assumes an app report will always have at least an empty resource usage report.
- YARN-398.
Major sub-task reported by Arun C Murthy and fixed by Arun C Murthy
Enhance CS to allow for white-list of resources
Allow white-list and black-list of resources in scheduler api.
- YARN-396.
Major sub-task reported by Bikas Saha and fixed by Zhijie Shen
Rationalize AllocateResponse in RM scheduler API
AllocateResponse contains an AMResponse and cluster node count. AMResponse that more data. Unless there is a good reason for this object structure, there should be either AMResponse or AllocateResponse.
- YARN-392.
Major sub-task reported by Bikas Saha and fixed by Sandy Ryza (resourcemanager)
Make it possible to specify hard locality constraints in resource requests
Currently its not possible to specify scheduling requests for specific nodes and nowhere else. The RM automatically relaxes locality to rack and * and assigns non-specified machines to the app.
- YARN-391.
Trivial improvement reported by Steve Loughran and fixed by Steve Loughran (nodemanager)
detabify LCEResourcesHandler classes
the LCEResourcesHandler classes from YARN-3 have had some tab chars that have snuck into the source tree. fix this before that code starts getting branched off and it's too late
- YARN-390.
Major bug reported by Chris Nauroth and fixed by Chris Nauroth (client)
ApplicationCLI and NodeCLI use hard-coded platform-specific line separator, which causes test failures on Windows
{{ApplicationCLI}}, {{NodeCLI}}, and the corresponding test {{TestYarnCLI}} all use a hard-coded '\n' as the line separator. This causes test failures on Windows.
- YARN-387.
Blocker sub-task reported by Vinod Kumar Vavilapalli and fixed by Vinod Kumar Vavilapalli
Fix inconsistent protocol naming
We now have different and inconsistent naming schemes for various protocols. It was hard to explain to users, mainly in direct interactions at talks/presentations and user group meetings, with such naming.
We should fix these before we go beta.
- YARN-385.
Major improvement reported by Sandy Ryza and fixed by Sandy Ryza (api)
ResourceRequestPBImpl's toString() is missing location and # containers
ResourceRequestPBImpl's toString method includes priority and resource capability, but omits location and number of containers.
- YARN-383.
Minor bug reported by Hitesh Shah and fixed by Hitesh Shah
AMRMClientImpl should handle null rmClient in stop()
2013-02-06 09:31:33,813 INFO [Thread-2] service.CompositeService (CompositeService.java:stop(101)) - Error stopping org.apache.hadoop.yarn.client.AMRMClientImpl
org.apache.hadoop.HadoopIllegalArgumentException: Cannot close proxy since it is null
at org.apache.hadoop.ipc.RPC.stopProxy(RPC.java:605)
at org.apache.hadoop.yarn.client.AMRMClientImpl.stop(AMRMClientImpl.java:150)
at org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:99)
at org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:89)
- YARN-382.
Major improvement reported by Thomas Graves and fixed by Zhijie Shen (scheduler)
SchedulerUtils improve way normalizeRequest sets the resource capabilities
In YARN-370, we changed it from setting the capability to directly setting memory and cores:
- ask.setCapability(normalized);
+ ask.getCapability().setMemory(normalized.getMemory());
+ ask.getCapability().setVirtualCores(normalized.getVirtualCores());
We did this because it is directly setting the values in the original resource object passed in when the AM gets allocated and without it the AM doesn't get the resource normalized correctly in the submission context. See YARN-370 for more details.
I think we should find a better way of doing this long term, one so we don't have to keep adding things there when new resources are added, two because its a bit confusing as to what its doing and prone to someone accidentally breaking it in the future again. Something closer to what Arun suggested in YARN-370 would be better but we need to make sure all the places work and get some more testing on it before putting it in.
- YARN-381.
Minor improvement reported by Eli Collins and fixed by Sandy Ryza (documentation)
Improve FS docs
The MR2 FS docs could use some improvements.
Configuration:
- sizebasedweight - what is the "size" here? Total memory usage?
Pool properties:
- minResources - what does min amount of aggregate memory mean given that this is not a reservation?
- maxResources - is this a hard limit?
- weight: How is this ratio configured? Eg base is 1 and all weights are relative to that?
- schedulingMode - what is the default? Is fifo pure fifo, eg waits until all tasks for the job are finished before launching the next job?
There's no mention of ACLs, even though they're supported. See the CS docs for comparison.
Also there are a couple typos worth fixing while we're at it, eg "finish. apps to run"
Worth keeping in mind that some of these will need to be updated to reflect that resource calculators are now pluggable.
- YARN-380.
Major bug reported by Thomas Graves and fixed by Omkar Vinit Joshi (client)
yarn node -status prints Last-Last-Health-Update
I assume the Last-Last-Health-Update is a typo and it should just be Last-Health-Update.
$ yarn node -status foo.com:8041
Node Report :
Node-Id : foo.com:8041
Rack : /10.10.10.0
Node-State : RUNNING
Node-Http-Address : foo.com:8042
Health-Status(isNodeHealthy) : true
Last-Last-Health-Update : 1360118400219
Health-Report :
Containers : 0
Memory-Used : 0M
Memory-Capacity : 24576
- YARN-378.
Major sub-task reported by xieguiming and fixed by Zhijie Shen (client , resourcemanager)
ApplicationMaster retry times should be set by Client
We should support that different client or user have different ApplicationMaster retry times. It also say that "yarn.resourcemanager.am.max-retries" should be set by client.
- YARN-377.
Minor bug reported by Tsz Wo (Nicholas), SZE and fixed by Chris Nauroth
Fix TestContainersMonitor for HADOOP-9252
HADOOP-9252 slightly changed the format of some StringUtils outputs. It caused TestContainersMonitor to fail.
Also, some methods were deprecated by HADOOP-9252. The use of them should be replaced with the new methods.
- YARN-376.
Blocker bug reported by Jason Lowe and fixed by Jason Lowe (resourcemanager)
Apps that have completed can appear as RUNNING on the NM UI
On a busy cluster we've noticed a growing number of applications appear as RUNNING on a nodemanager web pages but the applications have long since finished. Looking at the NM logs, it appears the RM never told the nodemanager that the application had finished. This is also reflected in a jstack of the NM process, since many more log aggregation threads are running then one would expect from the number of actively running applications.
- YARN-369.
Major sub-task reported by Hitesh Shah and fixed by Mayank Bansal (resourcemanager)
Handle ( or throw a proper error when receiving) status updates from application masters that have not registered
Currently, an allocate call from an unregistered application is allowed and the status update for it throws a statemachine error that is silently dropped.
org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: STATUS_UPDATE at LAUNCHED
at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:445)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:588)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:99)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:471)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:452)
at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130)
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:77)
at java.lang.Thread.run(Thread.java:680)
ApplicationMasterService should likely throw an appropriate error for applications' requests that should not be handled in such cases.
- YARN-368.
Trivial bug reported by Albert Chu and fixed by Albert Chu
Fix typo "defiend" should be "defined" in error output
Noticed the following in an error log output while doing some experiements
./1066018/nodes/hyperion987/log/yarn-achu-nodemanager-hyperion987.out:java.lang.RuntimeException: No class defiend for uda.shuffle
"defiend" should be "defined"
- YARN-365.
Major sub-task reported by Siddharth Seth and fixed by Xuan Gong (resourcemanager , scheduler)
Each NM heartbeat should not generate an event for the Scheduler
Follow up from YARN-275
https://issues.apache.org/jira/secure/attachment/12567075/Prototype.txt
- YARN-363.
Major bug reported by Jason Lowe and fixed by Kenji Kikushima
yarn proxyserver fails to find webapps/proxy directory on startup
Starting up the proxy server fails with this error:
{noformat}
2013-01-29 17:37:41,357 FATAL webproxy.WebAppProxy (WebAppProxy.java:start(99)) - Could not start proxy web server
java.io.FileNotFoundException: webapps/proxy not found in CLASSPATH
at org.apache.hadoop.http.HttpServer.getWebAppsPath(HttpServer.java:533)
at org.apache.hadoop.http.HttpServer.<init>(HttpServer.java:225)
at org.apache.hadoop.http.HttpServer.<init>(HttpServer.java:164)
at org.apache.hadoop.yarn.server.webproxy.WebAppProxy.start(WebAppProxy.java:90)
at org.apache.hadoop.yarn.service.CompositeService.start(CompositeService.java:68)
at org.apache.hadoop.yarn.server.webproxy.WebAppProxyServer.main(WebAppProxyServer.java:94)
{noformat}
- YARN-362.
Minor bug reported by Jason Lowe and fixed by Ravi Prakash
Unexpected extra results when using webUI table search
When using the search box on the web UI to search for a specific task number (e.g.: "0831"), sometimes unexpected extra results are shown. Using the web browser's built-in search-within-page does not show any hits, so these look like completely spurious results.
It looks like the raw timestamp value for time columns, which is not shown in the table, is also being searched with the search box.
- YARN-347.
Major improvement reported by Junping Du and fixed by Junping Du (client)
YARN CLI should show CPU info besides memory info in node status
With YARN-2 checked in, CPU info are taken into consideration in resource scheduling. yarn node -status <NodeID> should show CPU used and capacity info as memory info.
- YARN-345.
Critical bug reported by Devaraj K and fixed by Robert Parker (nodemanager)
Many InvalidStateTransitonException errors for ApplicationImpl in Node Manager
{code:xml}
org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: FINISH_APPLICATION at FINISHED
at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:301)
at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.handle(ApplicationImpl.java:398)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.handle(ApplicationImpl.java:58)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher.handle(ContainerManagerImpl.java:520)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher.handle(ContainerManagerImpl.java:512)
at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:126)
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
at java.lang.Thread.run(Thread.java:662)
{code}
{code:xml}
2013-01-17 04:03:46,726 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Can't handle this event at current state
org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: FINISH_APPLICATION at APPLICATION_RESOURCES_CLEANINGUP
at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:301)
at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.handle(ApplicationImpl.java:398)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.handle(ApplicationImpl.java:58)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher.handle(ContainerManagerImpl.java:520)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher.handle(ContainerManagerImpl.java:512)
at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:126)
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
at java.lang.Thread.run(Thread.java:662)
{code}
{code:xml}
2013-01-17 00:01:11,006 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Can't handle this event at current state
org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: FINISH_APPLICATION at FINISHING_CONTAINERS_WAIT
at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:301)
at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.handle(ApplicationImpl.java:398)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.handle(ApplicationImpl.java:58)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher.handle(ContainerManagerImpl.java:520)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher.handle(ContainerManagerImpl.java:512)
at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:126)
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
at java.lang.Thread.run(Thread.java:662)
{code}
{code:xml}
2013-01-17 10:56:36,975 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1358385982671_1304_01_000001 transitioned from NEW to DONE
2013-01-17 10:56:36,975 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Can't handle this event at current state
org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: APPLICATION_CONTAINER_FINISHED at FINISHED
at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:301)
at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.handle(ApplicationImpl.java:398)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.handle(ApplicationImpl.java:58)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher.handle(ContainerManagerImpl.java:520)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher.handle(ContainerManagerImpl.java:512)
at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:126)
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
at java.lang.Thread.run(Thread.java:662)
2013-01-17 10:56:36,975 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Application application_1358385982671_1304 transitioned from FINISHED to null
{code}
{code:xml}
2013-01-17 10:56:36,026 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Can't handle this event at current state
org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: INIT_CONTAINER at FINISHED
at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:301)
at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.handle(ApplicationImpl.java:398)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.handle(ApplicationImpl.java:58)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher.handle(ContainerManagerImpl.java:520)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher.handle(ContainerManagerImpl.java:512)
at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:126)
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
at java.lang.Thread.run(Thread.java:662)
2013-01-17 10:56:36,026 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Application application_1358385982671_1304 transitioned from FINISHED to null
{code}
- YARN-333.
Major bug reported by Sandy Ryza and fixed by Sandy Ryza
Schedulers cannot control the queue-name of an application
Currently, if an app is submitted without a queue, RMAppManager sets the RMApp's queue to "default".
A scheduler may wish to make its own decision on which queue to place an app in if none is specified. For example, when the fair scheduler user-as-default-queue config option is set to true, and an app is submitted with no queue specified, the fair scheduler should assign the app to a queue with the user's name.
- YARN-326.
Major new feature reported by Sandy Ryza and fixed by Sandy Ryza (scheduler)
Add multi-resource scheduling to the fair scheduler
With YARN-2 in, the capacity scheduler has the ability to schedule based on multiple resources, using dominant resource fairness. The fair scheduler should be able to do multiple resource scheduling as well, also using dominant resource fairness.
More details to come on how the corner cases with fair scheduler configs such as min and max resources will be handled.
- YARN-319.
Major bug reported by shenhong and fixed by shenhong (resourcemanager , scheduler)
Submit a job to a queue that not allowed in fairScheduler, client will hold forever.
RM use fairScheduler, when client submit a job to a queue, but the queue do not allow the user to submit job it, in this case, client will hold forever.
- YARN-309.
Major sub-task reported by Xuan Gong and fixed by Xuan Gong (resourcemanager)
Make RM provide heartbeat interval to NM
- YARN-297.
Major improvement reported by Arun C Murthy and fixed by Xuan Gong
Improve hashCode implementations for PB records
As [~hsn] pointed out in YARN-2, we use very small primes in all our hashCode implementations.
- YARN-295.
Major sub-task reported by Devaraj K and fixed by Mayank Bansal (resourcemanager)
Resource Manager throws InvalidStateTransitonException: Invalid event: CONTAINER_FINISHED at ALLOCATED for RMAppAttemptImpl
{code:xml}
2012-12-28 14:03:56,956 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Can't handle this event at current state
org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: CONTAINER_FINISHED at ALLOCATED
at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:301)
at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:490)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:80)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:433)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:414)
at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:126)
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
at java.lang.Thread.run(Thread.java:662)
{code}
- YARN-289.
Major bug reported by Sandy Ryza and fixed by Sandy Ryza
Fair scheduler allows reservations that won't fit on node
An application requests a container with 1024 MB. It then requests a container with 2048 MB. A node shows up with 1024 MB available. Even if the application is the only one running, neither request will be scheduled on it.
- YARN-269.
Major bug reported by Thomas Graves and fixed by Jason Lowe (resourcemanager)
Resource Manager not logging the health_check_script result when taking it out
The Resource Manager not logging the health_check_script result when taking it out. This was added to jobtracker in 1.x with MAPREDUCE-2451, we should do the same thing for RM.
- YARN-249.
Major improvement reported by Ravi Prakash and fixed by Ravi Prakash (capacityscheduler)
Capacity Scheduler web page should show list of active users per queue like it used to (in 1.x)
On the jobtracker, the web ui showed the active users for each queue and how much resources each of those users were using. That currently isn't being displayed on the RM capacity scheduler web ui.
- YARN-237.
Major improvement reported by Ravi Prakash and fixed by Jian He (resourcemanager)
Refreshing the RM page forgets how many rows I had in my Datatables
If I choose a 100 rows, and then refresh the page, DataTables goes back to showing me 20 rows.
This user preference should be stored in a cookie.
- YARN-236.
Major bug reported by Jason Lowe and fixed by Jason Lowe (resourcemanager)
RM should point tracking URL to RM web page when app fails to start
Similar to YARN-165, the RM should redirect the tracking URL to the specific app page on the RM web UI when the application fails to start. For example, if the AM completely fails to start due to bad AM config or bad job config like invalid queuename, then the user gets the unhelpful "The requested application exited before setting a tracking URL".
Usually the diagnostic string on the RM app page has something useful, so we might as well point there.
- YARN-227.
Major bug reported by Jason Lowe and fixed by Jason Lowe (resourcemanager)
Application expiration difficult to debug for end-users
When an AM attempt expires the AMLivelinessMonitor in the RM will kill the job and mark it as failed. However there are no diagnostic messages set for the application indicating that the application failed because of expiration. Even if the AM logs are examined, it's often not obvious that the application was externally killed. The only evidence of what happened to the application is currently in the RM logs, and those are often not accessible by users.
- YARN-209.
Major bug reported by Bikas Saha and fixed by Zhijie Shen (capacityscheduler)
Capacity scheduler doesn't trigger app-activation after adding nodes
Say application A is submitted but at that time it does not meet the bar for activation because of resource limit settings for applications. After that if more hardware is added to the system and the application becomes valid it still remains in pending state, likely forever.
This might be rare to hit in real life because enough NM's heartbeat to the RM before applications can get submitted. But a change in settings or heartbeat interval might make it easier to repro. In RM restart scenarios, this will likely hit more if its implemented by re-playing events and re-submitting applications to the scheduler before the RPC to NM's is activated.
- YARN-200.
Major sub-task reported by Robert Joseph Evans and fixed by Ravi Prakash
yarn log does not output all needed information, and is in a binary format
yarn logs does not output attemptid, nodename, or container-id. Missing these makes it very difficult to look through the logs for failed containers and tie them back to actual tasks and task attempts.
Also the output currently includes several binary characters. This is OK for being machine readable, but difficult for being human readable, or even for using standard tool like grep.
The help message can also be more useful to users
- YARN-198.
Minor improvement reported by Ramgopal N and fixed by Jian He (nodemanager)
If we are navigating to Nodemanager UI from Resourcemanager,then there is not link to navigate back to Resource manager
If we are navigating to Nodemanager by clicking on the node link in RM,there is no link provided on the NM to navigate back to RM.
If there is a link to navigate back to RM it would be good
- YARN-196.
Major bug reported by Ramgopal N and fixed by Xuan Gong (nodemanager)
Nodemanager should be more robust in handling connection failure to ResourceManager when a cluster is started
If NM is started before starting the RM ,NM is shutting down with the following error
{code}
ERROR org.apache.hadoop.yarn.service.CompositeService: Error starting services org.apache.hadoop.yarn.server.nodemanager.NodeManager
org.apache.avro.AvroRuntimeException: java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.start(NodeStatusUpdaterImpl.java:149)
at org.apache.hadoop.yarn.service.CompositeService.start(CompositeService.java:68)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.start(NodeManager.java:167)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:242)
Caused by: java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:66)
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:182)
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.start(NodeStatusUpdaterImpl.java:145)
... 3 more
Caused by: com.google.protobuf.ServiceException: java.net.ConnectException: Call From HOST-10-18-52-230/10.18.52.230 to HOST-10-18-52-250:8025 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:131)
at $Proxy23.registerNodeManager(Unknown Source)
at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:59)
... 5 more
Caused by: java.net.ConnectException: Call From HOST-10-18-52-230/10.18.52.230 to HOST-10-18-52-250:8025 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:857)
at org.apache.hadoop.ipc.Client.call(Client.java:1141)
at org.apache.hadoop.ipc.Client.call(Client.java:1100)
at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:128)
... 7 more
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:659)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:469)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:563)
at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:211)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1247)
at org.apache.hadoop.ipc.Client.call(Client.java:1117)
... 9 more
2012-01-16 15:04:13,336 WARN org.apache.hadoop.yarn.event.AsyncDispatcher: AsyncDispatcher thread interrupted
java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:1899)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1934)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:76)
at java.lang.Thread.run(Thread.java:619)
2012-01-16 15:04:13,337 INFO org.apache.hadoop.yarn.service.AbstractService: Service:Dispatcher is stopped.
2012-01-16 15:04:13,392 INFO org.mortbay.log: Stopped SelectChannelConnector@0.0.0.0:9999
2012-01-16 15:04:13,493 INFO org.apache.hadoop.yarn.service.AbstractService: Service:org.apache.hadoop.yarn.server.nodemanager.webapp.WebServer is stopped.
2012-01-16 15:04:13,493 INFO org.apache.hadoop.ipc.Server: Stopping server on 24290
2012-01-16 15:04:13,494 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 24290
2012-01-16 15:04:13,495 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
2012-01-16 15:04:13,496 INFO org.apache.hadoop.yarn.service.AbstractService: Service:org.apache.hadoop.yarn.server.nodemanager.containermanager.loghandler.NonAggregatingLogHandler is stopped.
2012-01-16 15:04:13,496 WARN org.apache.hadoop.yarn.event.AsyncDispatcher: AsyncDispatcher thread interrupted
java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:1899)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1934)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:76)
at java.lang.Thread.run(Thread.java:619)
{code}
- YARN-193.
Major bug reported by Hitesh Shah and fixed by Zhijie Shen (resourcemanager)
Scheduler.normalizeRequest does not account for allocation requests that exceed maximumAllocation limits
- YARN-142.
Blocker task reported by Siddharth Seth and fixed by
[Umbrella] Cleanup YARN APIs w.r.t exceptions
Ref: MAPREDUCE-4067
All YARN APIs currently throw YarnRemoteException.
1) This cannot be extended in it's current form.
2) The RPC layer can throw IOExceptions. These end up showing up as UndeclaredThrowableExceptions.
- YARN-125.
Minor sub-task reported by Steve Loughran and fixed by Steve Loughran
Make Yarn Client service shutdown operations robust
Make the yarn client services more robust against being shut down while not started, or shutdown more than once, by null-checking fields before closing them, setting to null afterwards to prevent double-invocation. This is a subset of MAPREDUCE-3502
- YARN-124.
Minor sub-task reported by Steve Loughran and fixed by Steve Loughran
Make Yarn Node Manager services robust against shutdown
Add the nodemanager bits of MAPREDUCE-3502 to shut down the Nodemanager services. This is done by checking for fields being non-null before shutting down/closing etc, and setting the fields to null afterwards -to be resilient against re-entrancy.
No tests other than manual review.
- YARN-123.
Minor sub-task reported by Steve Loughran and fixed by Steve Loughran
Make yarn Resource Manager services robust against shutdown
Split MAPREDUCE-3502 patches to make the RM code more resilient to being stopped more than once, or before started.
This depends on MAPREDUCE-4014.
- YARN-117.
Major improvement reported by Steve Loughran and fixed by Steve Loughran
Enhance YARN service model
Having played the YARN service model, there are some issues
that I've identified based on past work and initial use.
This JIRA issue is an overall one to cover the issues, with solutions pushed out to separate JIRAs.
h2. state model prevents stopped state being entered if you could not successfully start the service.
In the current lifecycle you cannot stop a service unless it was successfully started, but
* {{init()}} may acquire resources that need to be explicitly released
* if the {{start()}} operation fails partway through, the {{stop()}} operation may be needed to release resources.
*Fix:* make {{stop()}} a valid state transition from all states and require the implementations to be able to stop safely without requiring all fields to be non null.
Before anyone points out that the {{stop()}} operations assume that all fields are valid; and if called before a {{start()}} they will NPE; MAPREDUCE-3431 shows that this problem arises today, MAPREDUCE-3502 is a fix for this. It is independent of the rest of the issues in this doc but it will aid making {{stop()}} execute from all states other than "stopped".
MAPREDUCE-3502 is too big a patch and needs to be broken down for easier review and take up; this can be done with issues linked to this one.
h2. AbstractService doesn't prevent duplicate state change requests.
The {{ensureState()}} checks to verify whether or not a state transition is allowed from the current state are performed in the base {{AbstractService}} class -yet subclasses tend to call this *after* their own {{init()}}, {{start()}} & {{stop()}} operations. This means that these operations can be performed out of order, and even if the outcome of the call is an exception, all actions performed by the subclasses will have taken place. MAPREDUCE-3877 demonstrates this.
This is a tricky one to address. In HADOOP-3128 I used a base class instead of an interface and made the {{init()}}, {{start()}} & {{stop()}} methods {{final}}. These methods would do the checks, and then invoke protected inner methods, {{innerStart()}}, {{innerStop()}}, etc. It should be possible to retrofit the same behaviour to everything that extends {{AbstractService}} -something that must be done before the class is considered stable (because once the lifecycle methods are declared final, all subclasses that are out of the source tree will need fixing by the respective developers.
h2. AbstractService state change doesn't defend against race conditions.
There's no concurrency locks on the state transitions. Whatever fix for wrong state calls is added should correct this to prevent re-entrancy, such as {{stop()}} being called from two threads.
h2. Static methods to choreograph of lifecycle operations
Helper methods to move things through lifecycles. init->start is common, stop-if-service!=null another. Some static methods can execute these, and even call {{stop()}} if {{init()}} raises an exception. These could go into a class {{ServiceOps}} in the same package. These can be used by those services that wrap other services, and help manage more robust shutdowns.
h2. state transition failures are something that registered service listeners may wish to be informed of.
When a state transition fails a {{RuntimeException}} can be thrown -and the service listeners are not informed as the notification point isn't reached. They may wish to know this, especially for management and diagnostics.
*Fix:* extend {{ServiceStateChangeListener}} with a callback such as {{stateChangeFailed(Service service,Service.State targeted-state, RuntimeException e)}} that is invoked from the (final) state change methods in the {{AbstractService}} class (once they delegate to their inner {{innerStart()}}, {{innerStop()}} methods; make a no-op on the existing implementations of the interface.
h2. Service listener failures not handled
Is this an error an error or not? Log and ignore may not be what is desired.
*Proposed:* during {{stop()}} any exception by a listener is caught and discarded, to increase the likelihood of a better shutdown, but do not add try-catch clauses to the other state changes.
h2. Support static listeners for all AbstractServices
Add support to {{AbstractService}} that allow callers to register listeners for all instances. The existing listener interface could be used. This allows management tools to hook into the events.
The static listeners would be invoked for all state changes except creation (base class shouldn't be handing out references to itself at this point).
These static events could all be async, pushed through a shared {{ConcurrentLinkedQueue}}; failures logged at warn and the rest of the listeners invoked.
h2. Add some example listeners for management/diagnostics
* event to commons log for humans.
* events for machines hooked up to the JSON logger.
* for testing: something that be told to fail.
h2. Services should support signal interruptibility
The services would benefit from a way of shutting them down on a kill signal; this can be done via a runtime hook. It should not be automatic though, as composite services will get into a very complex state during shutdown. Better to provide a hook that lets you register/unregister services to terminate, and have the relevant {{main()}} entry points tell their root services to register themselves.
- YARN-112.
Major sub-task reported by Jason Lowe and fixed by Omkar Vinit Joshi (nodemanager)
Race in localization can cause containers to fail
On one of our 0.23 clusters, I saw a case of two containers, corresponding to two map tasks of a MR job, that were launched almost simultaneously on the same node. It appears they both tried to localize job.jar and job.xml at the same time. One of the containers failed when it couldn't rename the temporary job.jar directory to its final name because the target directory wasn't empty. Shortly afterwards the second container failed because job.xml could not be found, presumably because the first container removed it when it cleaned up.
- YARN-109.
Major bug reported by Jason Lowe and fixed by Mayank Bansal (nodemanager)
.tmp file is not deleted for localized archives
When archives are localized they are initially created as a .tmp file and unpacked from that file. However the .tmp file is not deleted afterwards.
- YARN-101.
Minor bug reported by xieguiming and fixed by Xuan Gong (nodemanager)
If the heartbeat message loss, the nodestatus info of complete container will loss too.
see the red color:
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.java
protected void startStatusUpdater() {
new Thread("Node Status Updater") {
@Override
@SuppressWarnings("unchecked")
public void run() {
int lastHeartBeatID = 0;
while (!isStopped) {
// Send heartbeat
try {
synchronized (heartbeatMonitor) {
heartbeatMonitor.wait(heartBeatInterval);
}
{color:red}
// Before we send the heartbeat, we get the NodeStatus,
// whose method removes completed containers.
NodeStatus nodeStatus = getNodeStatus();
{color}
nodeStatus.setResponseId(lastHeartBeatID);
NodeHeartbeatRequest request = recordFactory
.newRecordInstance(NodeHeartbeatRequest.class);
request.setNodeStatus(nodeStatus);
{color:red}
// But if the nodeHeartbeat fails, we've already removed the containers away to know about it. We aren't handling a nodeHeartbeat failure case here.
HeartbeatResponse response =
resourceTracker.nodeHeartbeat(request).getHeartbeatResponse();
{color}
if (response.getNodeAction() == NodeAction.SHUTDOWN) {
LOG
.info("Recieved SHUTDOWN signal from Resourcemanager as part of heartbeat," +
" hence shutting down.");
NodeStatusUpdaterImpl.this.stop();
break;
}
if (response.getNodeAction() == NodeAction.REBOOT) {
LOG.info("Node is out of sync with ResourceManager,"
+ " hence rebooting.");
NodeStatusUpdaterImpl.this.reboot();
break;
}
lastHeartBeatID = response.getResponseId();
List<ContainerId> containersToCleanup = response
.getContainersToCleanupList();
if (containersToCleanup.size() != 0) {
dispatcher.getEventHandler().handle(
new CMgrCompletedContainersEvent(containersToCleanup));
}
List<ApplicationId> appsToCleanup =
response.getApplicationsToCleanupList();
//Only start tracking for keepAlive on FINISH_APP
trackAppsForKeepAlive(appsToCleanup);
if (appsToCleanup.size() != 0) {
dispatcher.getEventHandler().handle(
new CMgrCompletedAppsEvent(appsToCleanup));
}
} catch (Throwable e) {
// TODO Better error handling. Thread can die with the rest of the
// NM still running.
LOG.error("Caught exception in status-updater", e);
}
}
}
}.start();
}
private NodeStatus getNodeStatus() {
NodeStatus nodeStatus = recordFactory.newRecordInstance(NodeStatus.class);
nodeStatus.setNodeId(this.nodeId);
int numActiveContainers = 0;
List<ContainerStatus> containersStatuses = new ArrayList<ContainerStatus>();
for (Iterator<Entry<ContainerId, Container>> i =
this.context.getContainers().entrySet().iterator(); i.hasNext();) {
Entry<ContainerId, Container> e = i.next();
ContainerId containerId = e.getKey();
Container container = e.getValue();
// Clone the container to send it to the RM
org.apache.hadoop.yarn.api.records.ContainerStatus containerStatus =
container.cloneAndGetContainerStatus();
containersStatuses.add(containerStatus);
++numActiveContainers;
LOG.info("Sending out status for container: " + containerStatus);
{color:red}
// Here is the part that removes the completed containers.
if (containerStatus.getState() == ContainerState.COMPLETE) {
// Remove
i.remove();
{color}
LOG.info("Removed completed container " + containerId);
}
}
nodeStatus.setContainersStatuses(containersStatuses);
LOG.debug(this.nodeId + " sending out status for "
+ numActiveContainers + " containers");
NodeHealthStatus nodeHealthStatus = this.context.getNodeHealthStatus();
nodeHealthStatus.setHealthReport(healthChecker.getHealthReport());
nodeHealthStatus.setIsNodeHealthy(healthChecker.isHealthy());
nodeHealthStatus.setLastHealthReportTime(
healthChecker.getLastHealthReportTime());
if (LOG.isDebugEnabled()) {
LOG.debug("Node's health-status : " + nodeHealthStatus.getIsNodeHealthy()
+ ", " + nodeHealthStatus.getHealthReport());
}
nodeStatus.setNodeHealthStatus(nodeHealthStatus);
List<ApplicationId> keepAliveAppIds = createKeepAliveApplicationList();
nodeStatus.setKeepAliveApplications(keepAliveAppIds);
return nodeStatus;
}
- YARN-99.
Major sub-task reported by Devaraj K and fixed by Omkar Vinit Joshi (nodemanager)
Jobs fail during resource localization when private distributed-cache hits unix directory limits
If we have multiple jobs which uses distributed cache with small size of files, the directory limit reaches before reaching the cache size and fails to create any directories in file cache. The jobs start failing with the below exception.
{code:xml}
java.io.IOException: mkdir of /tmp/nm-local-dir/usercache/root/filecache/1701886847734194975 failed
at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:909)
at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:143)
at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189)
at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:706)
at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:703)
at org.apache.hadoop.fs.FileContext$FSLinkResolver.resolve(FileContext.java:2325)
at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:703)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:147)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
{code}
We should have a mechanism to clean the cache files if it crosses specified number of directories like cache size.
- YARN-84.
Minor improvement reported by Brandon Li and fixed by Brandon Li
Use Builder to get RPC server in YARN
In HADOOP-8736, a Builder is introduced to replace all the getServer() variants. This JIRA is the change in YARN.
- YARN-71.
Critical bug reported by Vinod Kumar Vavilapalli and fixed by Xuan Gong (nodemanager)
Ensure/confirm that the NodeManager cleans up local-dirs on restart
We have to make sure that NodeManagers cleanup their local files on restart.
It may already be working like that in which case we should have tests validating this.
- YARN-62.
Major sub-task reported by Vinod Kumar Vavilapalli and fixed by Omkar Vinit Joshi
AM should not be able to abuse container tokens for repetitive container launches
Clone of YARN-51.
ApplicationMaster should not be able to store container tokens and use the same set of tokens for repetitive container launches. The possibility of such abuse is there in the current code, for a duration of 1d+10mins, we need to fix this.
- YARN-45.
Major sub-task reported by Chris Douglas and fixed by Carlo Curino (resourcemanager)
Scheduler feedback to AM to release containers
The ResourceManager strikes a balance between cluster utilization and strict enforcement of resource invariants in the cluster. Individual allocations of containers must be reclaimed- or reserved- to restore the global invariants when cluster load shifts. In some cases, the ApplicationMaster can respond to fluctuations in resource availability without losing the work already completed by that task (MAPREDUCE-4584). Supplying it with this information would be helpful for overall cluster utilization [1]. To this end, we want to establish a protocol for the RM to ask the AM to release containers.
[1] http://research.yahoo.com/files/yl-2012-003.pdf
- YARN-24.
Major bug reported by Jason Lowe and fixed by Sandy Ryza (nodemanager)
Nodemanager fails to start if log aggregation enabled and namenode unavailable
If log aggregation is enabled and the namenode is currently unavailable, the nodemanager fails to startup.
- MAPREDUCE-5421.
Blocker bug reported by Junping Du and fixed by Junping Du (test)
TestNonExistentJob is failed due to recent changes in YARN
- MAPREDUCE-5419.
Major bug reported by Robert Parker and fixed by Robert Parker (mrv2)
TestSlive is getting FileNotFound Exception
- MAPREDUCE-5412.
Major bug reported by Jian He and fixed by Jian He
Change MR to use multiple containers API of ContainerManager after YARN-926
- MAPREDUCE-5398.
Major improvement reported by Bikas Saha and fixed by Jian He
MR changes for YARN-513
- MAPREDUCE-5366.
Minor bug reported by Chuan Liu and fixed by Chuan Liu (test)
TestMRAsyncDiskService fails on Windows
- MAPREDUCE-5360.
Minor bug reported by Chuan Liu and fixed by Chuan Liu (test)
TestMRJobClient fails on Windows due to path format
- MAPREDUCE-5359.
Minor bug reported by Chuan Liu and fixed by Chuan Liu
JobHistory should not use File.separator to match timestamp in path
- MAPREDUCE-5357.
Minor bug reported by Chuan Liu and fixed by Chuan Liu
Job staging directory owner checking could fail on Windows
- MAPREDUCE-5355.
Minor bug reported by Chuan Liu and fixed by Chuan Liu
MiniMRYarnCluster with localFs does not work on Windows
- MAPREDUCE-5349.
Minor bug reported by Chuan Liu and fixed by Chuan Liu
TestClusterMapReduceTestCase and TestJobName fail on Windows in branch-2
- MAPREDUCE-5334.
Blocker bug reported by Vinod Kumar Vavilapalli and fixed by Vinod Kumar Vavilapalli
TestContainerLauncherImpl is failing
- MAPREDUCE-5333.
Major test reported by Alejandro Abdelnur and fixed by Wei Yan (mr-am)
Add test that verifies MRAM works correctly when sending requests with non-normalized capabilities
- MAPREDUCE-5328.
Major bug reported by Omkar Vinit Joshi and fixed by Omkar Vinit Joshi
ClientToken should not be set in the environment
- MAPREDUCE-5326.
Blocker bug reported by Arun C Murthy and fixed by Zhijie Shen
Add version to shuffle header
- MAPREDUCE-5325.
Major bug reported by Xuan Gong and fixed by Xuan Gong
ClientRMProtocol.getAllApplications should accept ApplicationType as a parameter---MR changes
- MAPREDUCE-5319.
Major bug reported by yeshavora and fixed by Xuan Gong
Job.xml file does not has 'user.name' property for Hadoop2
- MAPREDUCE-5315.
Critical bug reported by Mithun Radhakrishnan and fixed by Mithun Radhakrishnan (distcp)
DistCp reports success even on failure.
- MAPREDUCE-5312.
Major bug reported by Alejandro Abdelnur and fixed by Sandy Ryza
TestRMNMInfo is failing
- MAPREDUCE-5310.
Major bug reported by Alejandro Abdelnur and fixed by Alejandro Abdelnur (applicationmaster)
MRAM should not normalize allocation request capabilities
- MAPREDUCE-5308.
Major bug reported by Nathan Roberts and fixed by Nathan Roberts
Shuffling to memory can get out-of-sync when fetching multiple compressed map outputs
- MAPREDUCE-5304.
Blocker sub-task reported by Alejandro Abdelnur and fixed by Karthik Kambatla
mapreduce.Job killTask/failTask/getTaskCompletionEvents methods have incompatible signature changes
- MAPREDUCE-5303.
Major bug reported by Jian He and fixed by Jian He
Changes on MR after moving ProtoBase to package impl.pb on YARN-724
- MAPREDUCE-5301.
Major bug reported by Siddharth Seth and fixed by Siddharth Seth
Update MR code to work with YARN-635 changes
- MAPREDUCE-5300.
Major sub-task reported by Zhijie Shen and fixed by Zhijie Shen
Two function signature changes in filecache.DistributedCache
- MAPREDUCE-5299.
Major sub-task reported by Zhijie Shen and fixed by Zhijie Shen
Mapred API: void setTaskID(TaskAttemptID) is missing in TaskCompletionEvent
- MAPREDUCE-5298.
Major new feature reported by Steve Loughran and fixed by Steve Loughran (applicationmaster)
Move MapReduce services to YARN-117 stricter lifecycle
- MAPREDUCE-5297.
Major bug reported by Jian He and fixed by Jian He
Update MR App since BuilderUtils is moved to yarn-server-common after YARN-748
- MAPREDUCE-5296.
Major sub-task reported by Zhijie Shen and fixed by Zhijie Shen
Mapred API: Function signature change in JobControl
- MAPREDUCE-5291.
Major bug reported by Siddharth Seth and fixed by Zhijie Shen
Change MR App to use update property names in container-log4j.properties
- MAPREDUCE-5289.
Major bug reported by Vinod Kumar Vavilapalli and fixed by Jian He
Update MR App to use Token directly after YARN-717
- MAPREDUCE-5286.
Major task reported by Siddharth Seth and fixed by Vinod Kumar Vavilapalli
startContainer call should use the ContainerToken instead of Container [YARN-684]
- MAPREDUCE-5285.
Major bug reported by Jian He and fixed by
Update MR App to use immutable ApplicationAttemptID, ContainerID, NodeID after YARN-735
- MAPREDUCE-5283.
Major improvement reported by Sandy Ryza and fixed by Sandy Ryza (applicationmaster , test)
Over 10 different tests have near identical implementations of AppContext
- MAPREDUCE-5282.
Major bug reported by Vinod Kumar Vavilapalli and fixed by Siddharth Seth
Update MR App to use immutable ApplicationID after YARN-716
- MAPREDUCE-5280.
Major sub-task reported by Zhijie Shen and fixed by Mayank Bansal
Mapreduce API: ClusterMetrics incompatibility issues with MR1
- MAPREDUCE-5275.
Major sub-task reported by Zhijie Shen and fixed by Mayank Bansal
Mapreduce API: TokenCache incompatibility issues with MR1
- MAPREDUCE-5274.
Major sub-task reported by Zhijie Shen and fixed by Mayank Bansal
Mapreduce API: String toHex(byte[]) is removed from SecureShuffleUtils
- MAPREDUCE-5273.
Major sub-task reported by Zhijie Shen and fixed by Mayank Bansal
Protected variables are removed from CombineFileRecordReader in both mapred and mapreduce
- MAPREDUCE-5270.
Major bug reported by Jian He and fixed by Jian He
Migrate from using BuilderUtil factory methods to individual record factory method on MapReduce side
- MAPREDUCE-5268.
Major improvement reported by Jason Lowe and fixed by Karthik Kambatla (jobhistoryserver)
Improve history server startup performance
- MAPREDUCE-5263.
Major sub-task reported by Zhijie Shen and fixed by Zhijie Shen
filecache.DistributedCache incompatiblity issues with MR1
- MAPREDUCE-5259.
Major bug reported by Ivan Mitic and fixed by Ivan Mitic (test)
TestTaskLog fails on Windows because of path separators missmatch
- MAPREDUCE-5257.
Major bug reported by Jason Lowe and fixed by Omkar Vinit Joshi (mr-am , mrv2)
TestContainerLauncherImpl fails
- MAPREDUCE-5246.
Major improvement reported by Mayank Bansal and fixed by Mayank Bansal
Adding application type to submission context
- MAPREDUCE-5245.
Major sub-task reported by Zhijie Shen and fixed by Zhijie Shen
A number of public static variables are removed from JobConf
- MAPREDUCE-5244.
Major sub-task reported by Zhijie Shen and fixed by Zhijie Shen
Two functions changed their visibility in JobStatus
- MAPREDUCE-5240.
Blocker bug reported by Roman Shaposhnik and fixed by Vinod Kumar Vavilapalli (mrv2)
inside of FileOutputCommitter the initialized Credentials cache appears to be empty
- MAPREDUCE-5239.
Major bug reported by Vinod Kumar Vavilapalli and fixed by Siddharth Seth
Update MR App to reflect YarnRemoteException changes after YARN-634
- MAPREDUCE-5237.
Major sub-task reported by Zhijie Shen and fixed by Zhijie Shen
ClusterStatus incompatiblity issues with MR1
- MAPREDUCE-5235.
Major sub-task reported by Zhijie Shen and fixed by Mayank Bansal
mapred.Counters incompatiblity issues with MR1
- MAPREDUCE-5234.
Major sub-task reported by Zhijie Shen and fixed by Mayank Bansal
Signature changes for getTaskId of TaskReport in mapred
- MAPREDUCE-5233.
Major sub-task reported by Zhijie Shen and fixed by Mayank Bansal
Functions are changed or removed from Job in jobcontrol
- MAPREDUCE-5231.
Major sub-task reported by Zhijie Shen and fixed by Zhijie Shen
Constructor of DBInputFormat.DBRecordReader in mapred is changed
- MAPREDUCE-5230.
Major sub-task reported by Zhijie Shen and fixed by Mayank Bansal
createFileSplit is removed from NLineInputFormat of mapred
- MAPREDUCE-5229.
Major sub-task reported by Zhijie Shen and fixed by Zhijie Shen
TEMP_DIR_NAME is removed from of FileOutputCommitter of mapreduce
- MAPREDUCE-5228.
Major sub-task reported by Zhijie Shen and fixed by Mayank Bansal
Enum Counter is removed from FileInputFormat and FileOutputFormat of both mapred and mapreduce
- MAPREDUCE-5226.
Major bug reported by Xuan Gong and fixed by Xuan Gong
Handle exception related changes in YARN's AMRMProtocol api after YARN-630
- MAPREDUCE-5222.
Major sub-task reported by Karthik Kambatla and fixed by Karthik Kambatla
Fix JobClient incompatibilities with MR1
- MAPREDUCE-5220.
Major sub-task reported by Sandy Ryza and fixed by Zhijie Shen (client)
Mapred API: TaskCompletionEvent incompatibility issues with MR1
- MAPREDUCE-5213.
Minor bug reported by Karthik Kambatla and fixed by Karthik Kambatla
Re-assess TokenCache methods marked @Private
- MAPREDUCE-5212.
Major bug reported by Xuan Gong and fixed by Xuan Gong
Handle exception related changes in YARN's ClientRMProtocol api after YARN-631
- MAPREDUCE-5209.
Minor bug reported by Radim Kolar and fixed by Tsuyoshi OZAWA (mrv2)
ShuffleScheduler log message incorrect
- MAPREDUCE-5208.
Major bug reported by Omkar Vinit Joshi and fixed by Omkar Vinit Joshi
SpillRecord and ShuffleHandler should use SecureIOUtils for reading index file and map output
- MAPREDUCE-5205.
Blocker bug reported by Vinod Kumar Vavilapalli and fixed by Vinod Kumar Vavilapalli
Apps fail in secure cluster setup
- MAPREDUCE-5204.
Major bug reported by Xuan Gong and fixed by Xuan Gong
Handle YarnRemoteException separately from IOException in MR api
- MAPREDUCE-5199.
Blocker sub-task reported by Vinod Kumar Vavilapalli and fixed by Daryn Sharp (security)
AppTokens file can/should be removed
- MAPREDUCE-5194.
Minor task reported by Chris Douglas and fixed by Chris Douglas (task)
Heed interrupts during Fetcher shutdown
- MAPREDUCE-5193.
Major bug reported by Aaron T. Myers and fixed by Andrew Wang (test)
A few MR tests use block sizes which are smaller than the default minimum block size
- MAPREDUCE-5192.
Minor task reported by Chris Douglas and fixed by Chris Douglas (task)
Separate TCE resolution from fetch
- MAPREDUCE-5191.
Major bug reported by Ivan Mitic and fixed by Ivan Mitic
TestQueue#testQueue fails with timeout on Windows
- MAPREDUCE-5187.
Major bug reported by Chuan Liu and fixed by Chuan Liu (mrv2)
Create mapreduce command scripts on Windows
- MAPREDUCE-5184.
Major sub-task reported by Arun C Murthy and fixed by Zhijie Shen (documentation)
Document MR Binary Compatibility vis-a-vis hadoop-1 and hadoop-2
Document MR Binary Compatibility vis-a-vis hadoop-1 and hadoop-2 for end-users.
- MAPREDUCE-5181.
Major bug reported by Siddharth Seth and fixed by Vinod Kumar Vavilapalli (applicationmaster)
RMCommunicator should not use AMToken from the env
- MAPREDUCE-5179.
Major bug reported by Hitesh Shah and fixed by Hitesh Shah
Change TestHSWebServices to do string equal check on hadoop build version similar to YARN-605
- MAPREDUCE-5178.
Major bug reported by Hitesh Shah and fixed by Hitesh Shah
Fix use of BuilderUtils#newApplicationReport as a result of YARN-577.
- MAPREDUCE-5177.
Major bug reported by Ivan Mitic and fixed by Ivan Mitic
Move to common utils FileUtil#setReadable/Writable/Executable and FileUtil#canRead/Write/Execute
- MAPREDUCE-5176.
Major improvement reported by Carlo Curino and fixed by Carlo Curino (mrv2)
Preemptable annotations (to support preemption in MR)
- MAPREDUCE-5175.
Major bug reported by Vinod Kumar Vavilapalli and fixed by Xuan Gong
Update MR App to not set envs that will be set by NMs anyways after YARN-561
- MAPREDUCE-5171.
Major improvement reported by Sandy Ryza and fixed by Sandy Ryza (applicationmaster)
Expose blacklisted nodes from the MR AM REST API
- MAPREDUCE-5167.
Major bug reported by Vinod Kumar Vavilapalli and fixed by Jian He
Update MR App after YARN-562
- MAPREDUCE-5166.
Blocker bug reported by Gunther Hagleitner and fixed by Sandy Ryza
ConcurrentModificationException in LocalJobRunner
- MAPREDUCE-5163.
Major bug reported by Vinod Kumar Vavilapalli and fixed by Xuan Gong
Update MR App after YARN-441
- MAPREDUCE-5159.
Major sub-task reported by Zhijie Shen and fixed by Zhijie Shen
Aggregatewordcount and aggregatewordhist in hadoop-1 examples are not binary compatible with hadoop-2 mapred.lib.aggregate
- MAPREDUCE-5157.
Major sub-task reported by Zhijie Shen and fixed by Zhijie Shen
Sort in hadoop-1 examples is not binary compatible with hadoop-2 mapred.lib
- MAPREDUCE-5156.
Blocker sub-task reported by Zhijie Shen and fixed by Zhijie Shen
Hadoop-examples-1.x.x.jar cannot run on Yarn
- MAPREDUCE-5152.
Major bug reported by Vinod Kumar Vavilapalli and fixed by Vinod Kumar Vavilapalli
MR App is not using Container from RM
- MAPREDUCE-5151.
Major bug reported by Vinod Kumar Vavilapalli and fixed by Sandy Ryza
Update MR App after YARN-444
- MAPREDUCE-5147.
Major bug reported by Robert Parker and fixed by Robert Parker (mrv2)
Maven build should create hadoop-mapreduce-client-app-VERSION.jar directly
- MAPREDUCE-5146.
Minor bug reported by Sangjin Lee and fixed by Sangjin Lee (task)
application classloader may be used too early to load classes
- MAPREDUCE-5145.
Major bug reported by Zhijie Shen and fixed by Zhijie Shen
Change default max-attempts to be more than one for MR jobs as well
- MAPREDUCE-5140.
Major bug reported by Zhijie Shen and fixed by Zhijie Shen
MR part of YARN-514
- MAPREDUCE-5139.
Major bug reported by Vinod Kumar Vavilapalli and fixed by Xuan Gong
Update MR App after YARN-486
- MAPREDUCE-5138.
Major bug reported by Vinod Kumar Vavilapalli and fixed by Omkar Vinit Joshi
Fix LocalDistributedCacheManager after YARN-112
- MAPREDUCE-5137.
Major bug reported by Thomas Graves and fixed by Thomas Graves (applicationmaster)
AM web UI: clicking on Map Task results in 500 error
- MAPREDUCE-5136.
Major bug reported by Amir Sanjar and fixed by Amir Sanjar
TestJobImpl->testJobNoTasks fails with IBM JAVA
- MAPREDUCE-5129.
Minor new feature reported by Billie Rinaldi and fixed by Billie Rinaldi
Add tag info to JH files
- MAPREDUCE-5128.
Major improvement reported by Sandy Ryza and fixed by Sandy Ryza (documentation , jobhistoryserver)
mapred-default.xml is missing a bunch of history server configs
- MAPREDUCE-5113.
Major bug reported by Sandy Ryza and fixed by Sandy Ryza
Streaming input/output types are ignored with java mapper/reducer
- MAPREDUCE-5098.
Major bug reported by Karthik Kambatla and fixed by Karthik Kambatla (contrib/gridmix)
Fix findbugs warnings in gridmix
- MAPREDUCE-5086.
Major bug reported by Jian He and fixed by Jian He
MR app master deletes staging dir when sent a reboot command from the RM
- MAPREDUCE-5079.
Critical improvement reported by Jason Lowe and fixed by Jason Lowe (mr-am)
Recovery should restore task state from job history info directly
- MAPREDUCE-5078.
Major bug reported by Chris Nauroth and fixed by Chris Nauroth (client)
TestMRAppMaster fails on Windows due to mismatched path separators
- MAPREDUCE-5077.
Minor bug reported by Karthik Kambatla and fixed by Karthik Kambatla (mrv2)
Cleanup: mapreduce.util.ResourceCalculatorPlugin and related code should be removed
- MAPREDUCE-5075.
Major bug reported by Chris Nauroth and fixed by Chris Nauroth (distcp)
DistCp leaks input file handles
- MAPREDUCE-5069.
Minor improvement reported by Sangjin Lee and fixed by (mrv1 , mrv2)
add concrete common implementations of CombineFileInputFormat
- MAPREDUCE-5066.
Major bug reported by Ivan Mitic and fixed by Ivan Mitic
JobTracker should set a timeout when calling into job.end.notification.url
- MAPREDUCE-5065.
Major bug reported by Mithun Radhakrishnan and fixed by Mithun Radhakrishnan (distcp)
DistCp should skip checksum comparisons if block-sizes are different on source/target.
- MAPREDUCE-5062.
Major bug reported by Vinod Kumar Vavilapalli and fixed by Zhijie Shen
MR AM should read max-retries information from the RM
- MAPREDUCE-5060.
Critical bug reported by Robert Joseph Evans and fixed by Robert Joseph Evans
Fetch failures that time out only count against the first map task
- MAPREDUCE-5059.
Major bug reported by Jason Lowe and fixed by Omkar Vinit Joshi (jobhistoryserver , webapps)
Job overview shows average merge time larger than for any reduce attempt
- MAPREDUCE-5043.
Blocker bug reported by Jason Lowe and fixed by Jason Lowe (mr-am)
Fetch failure processing can cause AM event queue to backup and eventually OOM
- MAPREDUCE-5042.
Blocker bug reported by Jason Lowe and fixed by Jason Lowe (mr-am , security)
Reducer unable to fetch for a map task that was recovered
- MAPREDUCE-5033.
Minor improvement reported by Andrew Wang and fixed by Andrew Wang
mapred shell script should respect usage flags (--help -help -h)
- MAPREDUCE-5027.
Major bug reported by Jason Lowe and fixed by Robert Parker
Shuffle does not limit number of outstanding connections
- MAPREDUCE-5015.
Major test reported by Aleksey Gorshkov and fixed by Aleksey Gorshkov
Coverage fix for org.apache.hadoop.mapreduce.tools.CLI
- MAPREDUCE-5013.
Major bug reported by Sandy Ryza and fixed by Sandy Ryza (client)
mapred.JobStatus compatibility: MR2 missing constructors from MR1
- MAPREDUCE-5009.
Critical bug reported by Robert Parker and fixed by Robert Parker (mrv1)
Killing the Task Attempt slated for commit does not clear the value from the Task commitAttempt member
- MAPREDUCE-5008.
Major bug reported by Sandy Ryza and fixed by Sandy Ryza
Merger progress miscounts with respect to EOF_MARKER
- MAPREDUCE-5007.
Major test reported by Aleksey Gorshkov and fixed by Aleksey Gorshkov
fix coverage org.apache.hadoop.mapreduce.v2.hs
- MAPREDUCE-5000.
Critical bug reported by Jason Lowe and fixed by Jason Lowe (mr-am)
TaskImpl.getCounters() can return the counters for the wrong task attempt when task is speculating
- MAPREDUCE-4994.
Major bug reported by Sandy Ryza and fixed by Sandy Ryza (client)
-jt generic command line option does not work
- MAPREDUCE-4992.
Critical bug reported by Robert Parker and fixed by Robert Parker (mr-am)
AM hangs in RecoveryService when recovering tasks with speculative attempts
- MAPREDUCE-4991.
Major test reported by Aleksey Gorshkov and fixed by Aleksey Gorshkov
coverage for gridmix
- MAPREDUCE-4990.
Trivial improvement reported by Karthik Kambatla and fixed by Karthik Kambatla
Construct debug strings conditionally in ShuffleHandler.Shuffle#sendMapOutput()
- MAPREDUCE-4989.
Major improvement reported by Ravi Prakash and fixed by Ravi Prakash (jobhistoryserver , mr-am)
JSONify DataTables input data for Attempts page
- MAPREDUCE-4987.
Major bug reported by Chris Nauroth and fixed by Chris Nauroth (distributed-cache , nodemanager)
TestMRJobs#testDistributedCache fails on Windows due to classpath problems and unexpected behavior of symlinks
- MAPREDUCE-4985.
Trivial bug reported by Plamen Jeliazkov and fixed by Plamen Jeliazkov
TestDFSIO supports compression but usages doesn't reflect
- MAPREDUCE-4981.
Minor bug reported by Plamen Jeliazkov and fixed by Plamen Jeliazkov
WordMean, WordMedian, WordStandardDeviation missing from ExamplesDriver
- MAPREDUCE-4974.
Major improvement reported by Arun A K and fixed by Gelesh (mrv1 , mrv2 , performance)
Optimising the LineRecordReader initialize() method
- MAPREDUCE-4972.
Major test reported by Aleksey Gorshkov and fixed by Aleksey Gorshkov
Coverage fixing for org.apache.hadoop.mapreduce.jobhistory
- MAPREDUCE-4951.
Major bug reported by Sandy Ryza and fixed by Sandy Ryza (applicationmaster , mr-am , mrv2)
Container preemption interpreted as task failure
- MAPREDUCE-4942.
Major sub-task reported by Robert Kanter and fixed by Robert Kanter (mrv2)
mapreduce.Job has a bunch of methods that throw InterruptedException so its incompatible with MR1
- MAPREDUCE-4932.
Major bug reported by Robert Kanter and fixed by Robert Kanter (mrv2)
mapreduce.job#getTaskCompletionEvents incompatible with Hadoop 1
- MAPREDUCE-4927.
Major bug reported by Jason Lowe and fixed by Ashwin Shankar (jobhistoryserver)
Historyserver 500 error due to NPE when accessing specific counters page for failed job
- MAPREDUCE-4898.
Major bug reported by Robert Kanter and fixed by Robert Kanter (mrv2)
FileOutputFormat.checkOutputSpecs and FileOutputFormat.setOutputPath incompatible with MR1
- MAPREDUCE-4896.
Major bug reported by Sandy Ryza and fixed by Sandy Ryza (client , scheduler)
"mapred queue -info" spits out ugly exception when queue does not exist
- MAPREDUCE-4892.
Major bug reported by Bikas Saha and fixed by Bikas Saha
CombineFileInputFormat node input split can be skewed on small clusters
- MAPREDUCE-4885.
Major bug reported by Chris Nauroth and fixed by Chris Nauroth (contrib/streaming , test)
Streaming tests have multiple failures on Windows
- MAPREDUCE-4875.
Major test reported by Aleksey Gorshkov and fixed by Aleksey Gorshkov (test)
coverage fixing for org.apache.hadoop.mapred
- MAPREDUCE-4871.
Major bug reported by Jason Lowe and fixed by Jason Lowe (mrv2)
AM uses mapreduce.jobtracker.split.metainfo.maxsize but mapred-default has mapreduce.job.split.metainfo.maxsize
- MAPREDUCE-4846.
Major improvement reported by Sandy Ryza and fixed by Sandy Ryza (client)
Some JobQueueInfo methods are public in MR1 but protected in MR2
- MAPREDUCE-4794.
Major bug reported by Jason Lowe and fixed by Jason Lowe (applicationmaster)
DefaultSpeculator generates error messages on normal shutdown
- MAPREDUCE-4737.
Major bug reported by Daniel Dai and fixed by Arun C Murthy
Hadoop does not close output file / does not call Mapper.cleanup if exception in map
Ensure that mapreduce APIs are semantically consistent with mapred API w.r.t Mapper.cleanup and Reducer.cleanup; in the sense that cleanup is now called even if there is an error. The old mapred API already ensures that Mapper.close and Reducer.close are invoked during error handling. Note that it is an incompatible change, however end-users can override Mapper.run and Reducer.run to get the old (inconsistent) behaviour.
- MAPREDUCE-4716.
Major bug reported by Thomas Graves and fixed by Thomas Graves (jobhistoryserver)
TestHsWebServicesJobsQuery.testJobsQueryStateInvalid fails with jdk7
- MAPREDUCE-4693.
Major bug reported by Jason Lowe and fixed by Xuan Gong (jobhistoryserver , mrv2)
Historyserver should provide counters for failed tasks
- MAPREDUCE-4671.
Major bug reported by Bikas Saha and fixed by Bikas Saha
AM does not tell the RM about container requests that are no longer needed
- MAPREDUCE-4571.
Major bug reported by Thomas Graves and fixed by Thomas Graves (webapps)
TestHsWebServicesJobs fails on jdk7
- MAPREDUCE-4374.
Minor bug reported by Chuan Liu and fixed by Chuan Liu (mrv2)
Fix child task environment variable config and add support for Windows
- MAPREDUCE-4356.
Major bug reported by Ravi Gummadi and fixed by Ravi Gummadi (tools/rumen)
Provide access to ParsedTask.obtainTaskAttempts()
Made the method ParsedTask.obtainTaskAttempts() public.
- MAPREDUCE-4149.
Major bug reported by Ravi Gummadi and fixed by Ravi Gummadi (tools/rumen)
Rumen fails to parse certain counter strings
Fixes Rumen to parse counter strings containing the special characters "{" and "}".
- MAPREDUCE-4100.
Minor bug reported by Karam Singh and fixed by Amar Kamat (contrib/gridmix)
Sometimes gridmix emulates data larger much larger then acutal counter for map only jobs
Bug fixed in compression emulation feature for map only jobs.
- MAPREDUCE-4087.
Major bug reported by Ravi Gummadi and fixed by Ravi Gummadi
[Gridmix] GenerateDistCacheData job of Gridmix can become slow in some cases
Fixes the issue of GenerateDistCacheData job slowness.
- MAPREDUCE-4083.
Major bug reported by Karam Singh and fixed by Amar Kamat (contrib/gridmix)
GridMix emulated job tasks.resource-usage emulator for CPU usage throws NPE when Trace contains cumulativeCpuUsage value of 0 at attempt level
Fixes NPE in cpu emulation in Gridmix
- MAPREDUCE-4067.
Critical bug reported by Jitendra Nath Pandey and fixed by Xuan Gong
Replace YarnRemoteException with IOException in MRv2 APIs
- MAPREDUCE-4019.
Minor bug reported by B Anil Kumar and fixed by Ashwin Shankar (client)
-list-attempt-ids is not working
- MAPREDUCE-3953.
Major bug reported by Ravi Gummadi and fixed by Ravi Gummadi
Gridmix throws NPE and does not simulate a job if the trace contains null taskStatus for a task
Fixes NPE and makes Gridmix simulate succeeded-jobs-with-failed-tasks. All tasks of such simulated jobs(including the failed ones of original job) will succeed.
- MAPREDUCE-3872.
Major bug reported by Patrick Hunt and fixed by Robert Kanter (client , mrv2)
event handling races in ContainerLauncherImpl and TestContainerLauncher
- MAPREDUCE-3829.
Major bug reported by Ravi Gummadi and fixed by Ravi Gummadi (contrib/gridmix)
[Gridmix] Gridmix should give better error message when input-data directory already exists and -generate option is given
Makes Gridmix emit out correct error message when the input data directory already exists and -generate option is used. Makes Gridmix exit with proper exit codes when Gridmix fails in args-processing, startup/setup.
- MAPREDUCE-3787.
Major improvement reported by Amar Kamat and fixed by Amar Kamat (contrib/gridmix)
[Gridmix] Improve STRESS mode
JobMonitor can now deploy multiple threads for faster job-status polling. Use 'gridmix.job-monitor.thread-count' to set the number of threads. Stress mode now relies on the updates from the job monitor instead of polling for job status. Failures in job submission now get reported to the statistics module and ultimately reported to the user via summary.
- MAPREDUCE-3757.
Major bug reported by Ravi Gummadi and fixed by Ravi Gummadi (tools/rumen)
Rumen Folder is not adjusting the shuffleFinished and sortFinished times of reduce task attempts
Fixed the sortFinishTime and shuffleFinishTime adjustments in Rumen Folder.
- MAPREDUCE-3685.
Critical bug reported by anty.rao and fixed by anty (mrv2)
There are some bugs in implementation of MergeManager
- MAPREDUCE-3533.
Minor improvement reported by Steve Loughran and fixed by (mrv2)
have the service interface extend Closeable and use close() as its shutdown operation
- MAPREDUCE-3502.
Major task reported by Steve Loughran and fixed by Steve Loughran (mrv2)
Review all Service.stop() operations and make sure that they work before a service is started
- MAPREDUCE-3008.
Major sub-task reported by Amar Kamat and fixed by Amar Kamat (contrib/gridmix)
[Gridmix] Improve cumulative CPU usage emulation for short running tasks
Improves cumulative CPU emulation for short running tasks.
- MAPREDUCE-2722.
Major bug reported by Ravi Gummadi and fixed by Ravi Gummadi (contrib/gridmix)
Gridmix simulated job's map's hdfsBytesRead counter is wrong when compressed input is used
Makes Gridmix use the uncompressed input data size while simulating map tasks in the case where compressed input data was used in original job.
- HDFS-5027.
Major improvement reported by Aaron T. Myers and fixed by Aaron T. Myers (datanode)
On startup, DN should scan volumes in parallel
- HDFS-5025.
Major sub-task reported by Jing Zhao and fixed by Jing Zhao (ha , namenode)
Record ClientId and CallId in EditLog to enable rebuilding retry cache in case of HA failover
- HDFS-5024.
Major bug reported by Arpit Agarwal and fixed by Arpit Agarwal (namenode)
Make DatanodeProtocol#commitBlockSynchronization idempotent
- HDFS-5020.
Major improvement reported by Jing Zhao and fixed by Jing Zhao (namenode)
Make DatanodeProtocol#blockReceivedAndDeleted idempotent
- HDFS-5018.
Minor bug reported by Ted Yu and fixed by Ted Yu
Misspelled DFSConfigKeys#DFS_NAMENODE_STALE_DATANODE_INTERVAL_DEFAULT in javadoc of DatanodeInfo#isStale()
- HDFS-5016.
Blocker bug reported by Devaraj Das and fixed by Suresh Srinivas
Deadlock in pipeline recovery causes Datanode to be marked dead
- HDFS-5010.
Major improvement reported by Kihwal Lee and fixed by Kihwal Lee (namenode , performance)
Reduce the frequency of getCurrentUser() calls from namenode
- HDFS-5008.
Major improvement reported by Suresh Srinivas and fixed by Jing Zhao (namenode)
Make ClientProtocol#abandonBlock() idempotent
- HDFS-5007.
Minor improvement reported by Kousuke Saruta and fixed by Kousuke Saruta
Replace hard-coded property keys with DFSConfigKeys fields
- HDFS-5005.
Major bug reported by Jing Zhao and fixed by Jing Zhao
Move SnapshotException and SnapshotAccessControlException to o.a.h.hdfs.protocol
- HDFS-5003.
Minor bug reported by Xi Fang and fixed by Xi Fang (test)
TestNNThroughputBenchmark failed caused by existing directories
- HDFS-4999.
Major bug reported by Kihwal Lee and fixed by Colin Patrick McCabe
fix TestShortCircuitLocalRead on branch-2
- HDFS-4998.
Major bug reported by Kihwal Lee and fixed by Kihwal Lee (test)
TestUnderReplicatedBlocks fails intermittently
- HDFS-4996.
Minor improvement reported by Chris Nauroth and fixed by Chris Nauroth (namenode)
ClientProtocol#metaSave can be made idempotent by overwriting the output file instead of appending to it
The dfsadmin -metasave command has been changed to overwrite the output file. Previously, this command would append to the output file if it already existed.
- HDFS-4992.
Major improvement reported by Max Lapan and fixed by Max Lapan (balancer)
Make balancer's thread count configurable
- HDFS-4982.
Major bug reported by Todd Lipcon and fixed by Todd Lipcon (journal-node , security)
JournalNode should relogin from keytab before fetching logs from other JNs
- HDFS-4980.
Major bug reported by Mark Grover and fixed by Mark Grover (build)
Incorrect logging.properties file for hadoop-httpfs
- HDFS-4979.
Major sub-task reported by Suresh Srinivas and fixed by Suresh Srinivas (namenode)
Implement retry cache on the namenode
- HDFS-4978.
Major improvement reported by Jing Zhao and fixed by Jing Zhao
Make disallowSnapshot idempotent
- HDFS-4974.
Major sub-task reported by Suresh Srinivas and fixed by Suresh Srinivas (ha , namenode)
Analyze and add annotations to Namenode protocol methods and enable retry
- HDFS-4969.
Blocker bug reported by Robert Kanter and fixed by Robert Kanter (test , webhdfs)
WebhdfsFileSystem expects non-standard WEBHDFS Json element
- HDFS-4954.
Major bug reported by Brandon Li and fixed by Brandon Li (nfs)
compile failure in branch-2: getFlushedOffset should catch or rethrow IOException
- HDFS-4951.
Major bug reported by Robert Kanter and fixed by Robert Kanter (security)
FsShell commands using secure httpfs throw exceptions due to missing TokenRenewer
- HDFS-4948.
Major bug reported by Robert Joseph Evans and fixed by Brandon Li
mvn site for hadoop-hdfs-nfs fails
- HDFS-4944.
Major bug reported by Chris Nauroth and fixed by Chris Nauroth (webhdfs)
WebHDFS cannot create a file path containing characters that must be URI-encoded, such as space.
- HDFS-4943.
Minor bug reported by Jerry He and fixed by Jerry He (webhdfs)
WebHdfsFileSystem does not work when original file path has encoded chars
- HDFS-4932.
Minor improvement reported by Fengdong Yu and fixed by Fengdong Yu (ha , namenode)
Avoid a wide line on the name node webUI if we have more Journal nodes
- HDFS-4927.
Minor bug reported by Chris Nauroth and fixed by Chris Nauroth (test)
CreateEditsLog creates inodes with an invalid inode ID, which then cannot be loaded by a namenode.
- HDFS-4917.
Major bug reported by Fengdong Yu and fixed by Fengdong Yu (datanode , namenode)
Start-dfs.sh cannot pass the parameters correctly
- HDFS-4914.
Minor improvement reported by Tsz Wo (Nicholas), SZE and fixed by Tsz Wo (Nicholas), SZE (hdfs-client)
When possible, Use DFSClient.Conf instead of Configuration
- HDFS-4912.
Major improvement reported by Suresh Srinivas and fixed by Suresh Srinivas (namenode)
Cleanup FSNamesystem#startFileInternal
- HDFS-4910.
Major bug reported by Chuan Liu and fixed by Chuan Liu
TestPermission failed in branch-2
- HDFS-4908.
Major sub-task reported by Tsz Wo (Nicholas), SZE and fixed by Tsz Wo (Nicholas), SZE (namenode , snapshots)
Reduce snapshot inode memory usage
- HDFS-4906.
Major bug reported by Aaron T. Myers and fixed by Aaron T. Myers (hdfs-client)
HDFS Output streams should not accept writes after being closed
- HDFS-4903.
Minor improvement reported by Suresh Srinivas and fixed by Arpit Agarwal (namenode)
Print trash configuration and trash emptier state in namenode log
- HDFS-4902.
Major bug reported by Binglin Chang and fixed by Binglin Chang (snapshots)
DFSClient.getSnapshotDiffReport should use string path rather than o.a.h.fs.Path
- HDFS-4888.
Major bug reported by Ravi Prakash and fixed by Ravi Prakash
Refactor and fix FSNamesystem.getTurnOffTip to sanity
- HDFS-4887.
Blocker bug reported by Kihwal Lee and fixed by Kihwal Lee (benchmarks , test)
TestNNThroughputBenchmark exits abruptly
- HDFS-4883.
Major bug reported by Konstantin Shvachko and fixed by Tao Luo (namenode)
complete() should verify fileId
- HDFS-4880.
Major bug reported by Arpit Agarwal and fixed by Suresh Srinivas (namenode)
Diagnostic logging while loading name/edits files
- HDFS-4878.
Major bug reported by Tao Luo and fixed by Tao Luo (namenode)
On Remove Block, Block is not Removed from neededReplications queue
- HDFS-4877.
Blocker bug reported by Jing Zhao and fixed by Jing Zhao (snapshots)
Snapshot: fix the scenario where a directory is renamed under its prior descendant
- HDFS-4876.
Minor sub-task reported by Tsz Wo (Nicholas), SZE and fixed by Tsz Wo (Nicholas), SZE (snapshots)
The javadoc of FileWithSnapshot is incorrect
- HDFS-4875.
Minor sub-task reported by Tsz Wo (Nicholas), SZE and fixed by Arpit Agarwal (snapshots , test)
Add a test for testing snapshot file length
- HDFS-4873.
Major bug reported by Hari Mankude and fixed by Jing Zhao (snapshots)
callGetBlockLocations returns incorrect number of blocks for snapshotted files
- HDFS-4867.
Major bug reported by Kihwal Lee and fixed by Plamen Jeliazkov (namenode)
metaSave NPEs when there are invalid blocks in repl queue.
- HDFS-4866.
Blocker bug reported by Ralph Castain and fixed by Arpit Agarwal (namenode)
Protocol buffer support cannot compile under C
The Protocol Buffers definition of the inter-namenode protocol required a change for compatibility with compiled C clients. This is a backwards-incompatible change. A namenode prior to this change will not be able to communicate with a namenode after this change.
- HDFS-4865.
Major bug reported by Wei Yan and fixed by Wei Yan
Remove sub resource warning from httpfs log at startup time
- HDFS-4863.
Major bug reported by Jing Zhao and fixed by Jing Zhao (snapshots)
The root directory should be added to the snapshottable directory list while loading fsimage
- HDFS-4862.
Major bug reported by Ravi Prakash and fixed by Ravi Prakash
SafeModeInfo.isManual() returns true when resources are low even if it wasn't entered into manually
- HDFS-4857.
Major bug reported by Jing Zhao and fixed by Jing Zhao (snapshots)
Snapshot.Root and AbstractINodeDiff#snapshotINode should not be put into INodeMap when loading FSImage
- HDFS-4850.
Major bug reported by Stephen Chu and fixed by Jing Zhao (tools)
fix OfflineImageViewer to work on fsimages with empty files or snapshots
- HDFS-4848.
Minor improvement reported by Stephen Chu and fixed by Jing Zhao (snapshots)
copyFromLocal and renaming a file to ".snapshot" should output that ".snapshot" is a reserved name
- HDFS-4846.
Minor bug reported by Stephen Chu and fixed by Jing Zhao (snapshots)
Clean up snapshot CLI commands output stacktrace for invalid arguments
- HDFS-4845.
Critical bug reported by Kihwal Lee and fixed by Arpit Agarwal (namenode)
FSEditLogLoader gets NPE while accessing INodeMap in TestEditLogRace
- HDFS-4842.
Major sub-task reported by Jing Zhao and fixed by Jing Zhao (snapshots)
Snapshot: identify the correct prior snapshot when deleting a snapshot under a renamed subtree
- HDFS-4841.
Major bug reported by Stephen Chu and fixed by Robert Kanter (security , webhdfs)
FsShell commands using secure webhfds fail ClientFinalizer shutdown hook
- HDFS-4840.
Major bug reported by Kihwal Lee and fixed by Kihwal Lee (namenode)
ReplicationMonitor gets NPE during shutdown
- HDFS-4832.
Critical bug reported by Ravi Prakash and fixed by Ravi Prakash
Namenode doesn't change the number of missing blocks in safemode when DNs rejoin or leave
This change makes name node keep its internal replication queues and data node state updated in manual safe mode. This allows metrics and UI to present up-to-date information while in safe mode. The behavior during start-up safe mode is unchanged.
- HDFS-4830.
Minor bug reported by Aaron T. Myers and fixed by Aaron T. Myers
Typo in config settings for AvailableSpaceVolumeChoosingPolicy in hdfs-default.xml
- HDFS-4827.
Major bug reported by Devaraj Das and fixed by Devaraj Das
Slight update to the implementation of API for handling favored nodes in DFSClient
- HDFS-4826.
Minor bug reported by Chris Nauroth and fixed by Chris Nauroth (test)
TestNestedSnapshots times out due to repeated slow edit log flushes when running on virtualized disk
- HDFS-4825.
Major bug reported by Andrew Wang and fixed by Andrew Wang (webhdfs)
webhdfs / httpfs tests broken because of min block size change
- HDFS-4824.
Major bug reported by Henry Robinson and fixed by Colin Patrick McCabe (hdfs-client)
FileInputStreamCache.close leaves dangling reference to FileInputStreamCache.cacheCleaner
- HDFS-4819.
Minor sub-task reported by Tsz Wo (Nicholas), SZE and fixed by Tsz Wo (Nicholas), SZE (documentation)
Update Snapshot doc for HDFS-4758
- HDFS-4818.
Minor bug reported by Chris Nauroth and fixed by Chris Nauroth (namenode , test)
several HDFS tests that attempt to make directories unusable do not work correctly on Windows
- HDFS-4815.
Major bug reported by Tian Hong Wang and fixed by Tian Hong Wang (datanode , test)
TestRBWBlockInvalidation#testBlockInvalidationWhenRBWReplicaMissedInDN: Double call countReplicas() to fetch corruptReplicas and liveReplicas is not needed
- HDFS-4813.
Minor bug reported by Tsz Wo (Nicholas), SZE and fixed by Jing Zhao (namenode)
BlocksMap may throw NullPointerException during shutdown
- HDFS-4810.
Major bug reported by Chris Nauroth and fixed by Chris Nauroth (test)
several HDFS HA tests have timeouts that are too short
- HDFS-4807.
Major bug reported by Kihwal Lee and fixed by Cristina L. Abad
DFSOutputStream.createSocketForPipeline() should not include timeout extension on connect
- HDFS-4805.
Critical bug reported by Daryn Sharp and fixed by Daryn Sharp (webhdfs)
Webhdfs client is fragile to token renewal errors
- HDFS-4804.
Minor improvement reported by Stephen Chu and fixed by Stephen Chu
WARN when users set the block balanced preference percent below 0.5 or above 1.0
- HDFS-4799.
Blocker bug reported by Todd Lipcon and fixed by Todd Lipcon (namenode)
Corrupt replica can be prematurely removed from corruptReplicas map
- HDFS-4797.
Minor bug reported by Tsz Wo (Nicholas), SZE and fixed by Tsz Wo (Nicholas), SZE (datanode)
BlockScanInfo does not override equals(..) and hashCode() consistently
- HDFS-4787.
Major improvement reported by Tian Hong Wang and fixed by Tian Hong Wang
Create a new HdfsConfiguration before each TestDFSClientRetries testcases
- HDFS-4785.
Major sub-task reported by Suresh Srinivas and fixed by Suresh Srinivas (namenode)
Concat operation does not remove concatenated files from InodeMap
- HDFS-4784.
Major sub-task reported by Brandon Li and fixed by Brandon Li (namenode)
NPE in FSDirectory.resolvePath()
- HDFS-4783.
Major bug reported by Chris Nauroth and fixed by Chris Nauroth (test)
TestDelegationTokensWithHA#testHAUtilClonesDelegationTokens fails on Windows
- HDFS-4780.
Minor bug reported by Kihwal Lee and fixed by Robert Parker (namenode)
Use the correct relogin method for services
- HDFS-4778.
Major bug reported by Devaraj Das and fixed by Devaraj Das (namenode)
Invoke getPipeline in the chooseTarget implementation that has favoredNodes
- HDFS-4772.
Minor improvement reported by Brandon Li and fixed by Brandon Li (namenode)
Add number of children in HdfsFileStatus
- HDFS-4768.
Major bug reported by Chris Nauroth and fixed by Chris Nauroth (datanode)
File handle leak in datanode when a block pool is removed
- HDFS-4765.
Major bug reported by Andrew Wang and fixed by Andrew Wang (namenode)
Permission check of symlink deletion incorrectly throws UnresolvedLinkException
- HDFS-4762.
Major sub-task reported by Brandon Li and fixed by Brandon Li (nfs)
Provide HDFS based NFSv3 and Mountd implementation
- HDFS-4751.
Minor bug reported by Andrew Wang and fixed by Andrew Wang (test)
TestLeaseRenewer#testThreadName flakes
- HDFS-4748.
Major bug reported by Chris Nauroth and fixed by Chris Nauroth (qjm , test)
MiniJournalCluster#restartJournalNode leaks resources, which causes sporadic test failures
- HDFS-4745.
Major bug reported by Chris Nauroth and fixed by Chris Nauroth (test)
TestDataTransferKeepalive#testSlowReader has race condition that causes sporadic failure
- HDFS-4743.
Major bug reported by Chris Nauroth and fixed by Chris Nauroth (test)
TestNNStorageRetentionManager fails on Windows
- HDFS-4741.
Major bug reported by Arpit Agarwal and fixed by Arpit Agarwal (test)
TestStorageRestore#testStorageRestoreFailure fails on Windows
- HDFS-4740.
Major bug reported by Arpit Agarwal and fixed by Arpit Agarwal (test)
Fixes for a few test failures on Windows
- HDFS-4739.
Major bug reported by Aaron T. Myers and fixed by Aaron T. Myers (namenode)
NN can miscalculate the number of extra edit log segments to retain
- HDFS-4737.
Major bug reported by Sean Mackrory and fixed by Sean Mackrory
JVM path embedded in fuse binaries
- HDFS-4734.
Major bug reported by Arpit Agarwal and fixed by Arpit Agarwal
HDFS Tests that use ShellCommandFencer are broken on Windows
- HDFS-4733.
Major bug reported by Alejandro Abdelnur and fixed by Alejandro Abdelnur
Make HttpFS username pattern configurable
- HDFS-4732.
Minor bug reported by Chris Nauroth and fixed by Chris Nauroth (test)
TestDFSUpgradeFromImage fails on Windows due to failure to unpack old image tarball that contains hard links
- HDFS-4725.
Major bug reported by Chris Nauroth and fixed by Chris Nauroth (namenode , test , tools)
fix HDFS file handle leaks
- HDFS-4722.
Major bug reported by Ivan Mitic and fixed by Ivan Mitic (test)
TestGetConf#testFederation times out on Windows
- HDFS-4721.
Major improvement reported by Varun Sharma and fixed by Varun Sharma (namenode)
Speed up lease/block recovery when DN fails and a block goes into recovery
- HDFS-4714.
Major bug reported by Kihwal Lee and fixed by Kihwal Lee (namenode)
Log short messages in Namenode RPC server for exceptions meant for clients
- HDFS-4705.
Minor bug reported by Ivan Mitic and fixed by Ivan Mitic
Address HDFS test failures on Windows because of invalid dfs.namenode.name.dir
- HDFS-4699.
Major bug reported by Chris Nauroth and fixed by Chris Nauroth (test)
TestPipelinesFailover#testPipelineRecoveryStress fails sporadically
- HDFS-4698.
Minor improvement reported by Colin Patrick McCabe and fixed by Colin Patrick McCabe (hdfs-client)
provide client-side metrics for remote reads, local reads, and short-circuit reads
- HDFS-4695.
Major bug reported by Ivan Mitic and fixed by Ivan Mitic (test)
TestEditLog leaks open file handles between tests
- HDFS-4693.
Minor bug reported by Arpit Agarwal and fixed by Arpit Agarwal (test)
Some test cases in TestCheckpoint do not clean up after themselves
- HDFS-4687.
Minor bug reported by Andrew Wang and fixed by Andrew Wang (test)
TestDelegationTokenForProxyUser#testWebHdfsDoAs is flaky with JDK7
- HDFS-4679.
Major improvement reported by Suresh Srinivas and fixed by Suresh Srinivas (namenode)
Namenode operation checks should be done in a consistent manner
- HDFS-4677.
Major bug reported by Ivan Mitic and fixed by Ivan Mitic
Editlog should support synchronous writes
- HDFS-4676.
Minor bug reported by Suresh Srinivas and fixed by Suresh Srinivas (test)
TestHDFSFileSystemContract should set MiniDFSCluster variable to null to free up memory
- HDFS-4674.
Major bug reported by Chris Nauroth and fixed by Chris Nauroth (test)
TestBPOfferService fails on Windows due to failure parsing datanode data directory as URI
- HDFS-4669.
Major bug reported by Tian Hong Wang and fixed by Tian Hong Wang (test)
TestBlockPoolManager fails using IBM java
- HDFS-4659.
Major bug reported by Brandon Li and fixed by Brandon Li (namenode)
Support setting execution bit for regular files
- HDFS-4658.
Trivial bug reported by Aaron T. Myers and fixed by Aaron T. Myers (ha , namenode)
Standby NN will log that it has received a block report "after becoming active"
- HDFS-4655.
Minor bug reported by Aaron T. Myers and fixed by Aaron T. Myers (datanode)
DNA_FINALIZE is logged as being an unknown command by the DN when received from the standby NN
- HDFS-4646.
Minor bug reported by Jagane Sundar and fixed by (namenode)
createNNProxyWithClientProtocol ignores configured timeout value
- HDFS-4645.
Major improvement reported by Suresh Srinivas and fixed by Arpit Agarwal (namenode)
Move from randomly generated block ID to sequentially generated block ID
- HDFS-4643.
Trivial bug reported by Todd Lipcon and fixed by Todd Lipcon (qjm , test)
Fix flakiness in TestQuorumJournalManager
- HDFS-4639.
Major bug reported by Konstantin Shvachko and fixed by Plamen Jeliazkov (namenode)
startFileInternal() should not increment generation stamp
- HDFS-4635.
Major improvement reported by Suresh Srinivas and fixed by Suresh Srinivas (namenode)
Move BlockManager#computeCapacity to LightWeightGSet
- HDFS-4625.
Minor bug reported by Arpit Agarwal and fixed by Ivan Mitic (test)
Make TestNNWithQJM#testNewNamenodeTakesOverWriter work on Windows
- HDFS-4621.
Minor bug reported by Todd Lipcon and fixed by Todd Lipcon (ha , qjm)
additional logging to help diagnose slow QJM logSync
- HDFS-4620.
Major bug reported by Sandy Ryza and fixed by Sandy Ryza (documentation)
Documentation for dfs.namenode.rpc-address specifies wrong format
- HDFS-4618.
Major bug reported by Todd Lipcon and fixed by Todd Lipcon (namenode)
default for checkpoint txn interval is too low
- HDFS-4615.
Major bug reported by Arpit Agarwal and fixed by Arpit Agarwal (test)
Fix TestDFSShell failures on Windows
- HDFS-4614.
Trivial bug reported by Aaron T. Myers and fixed by Aaron T. Myers (namenode)
FSNamesystem#getContentSummary should use getPermissionChecker helper method
- HDFS-4610.
Major bug reported by Ivan Mitic and fixed by Ivan Mitic
Move to using common utils FileUtil#setReadable/Writable/Executable and FileUtil#canRead/Write/Execute
- HDFS-4609.
Minor bug reported by Ivan Mitic and fixed by Ivan Mitic (test)
TestAuditLogs should release log handles between tests
- HDFS-4607.
Minor bug reported by Ivan Mitic and fixed by Ivan Mitic (test)
TestGetConf#testGetSpecificKey fails on Windows
- HDFS-4604.
Major bug reported by Ivan Mitic and fixed by Ivan Mitic
TestJournalNode fails on Windows
- HDFS-4603.
Major bug reported by Ivan Mitic and fixed by Ivan Mitic
TestMiniDFSCluster fails on Windows
- HDFS-4602.
Major sub-task reported by Suresh Srinivas and fixed by Uma Maheswara Rao G
TestBookKeeperHACheckpoints fails
- HDFS-4598.
Minor bug reported by Tsz Wo (Nicholas), SZE and fixed by Tsz Wo (Nicholas), SZE (webhdfs)
WebHDFS concat: the default value of sources in the code does not match the doc
- HDFS-4596.
Major bug reported by Andrew Wang and fixed by Andrew Wang (namenode)
Shutting down namenode during checkpointing can lead to md5sum error
- HDFS-4595.
Major bug reported by Suresh Srinivas and fixed by Suresh Srinivas (hdfs-client)
When short circuit read is fails, DFSClient does not fallback to regular reads
- HDFS-4593.
Major bug reported by Arpit Agarwal and fixed by Arpit Agarwal
TestSaveNamespace fails on Windows
- HDFS-4592.
Minor bug reported by Aaron T. Myers and fixed by Aaron T. Myers (namenode)
Default values for access time precision are out of sync between hdfs-default.xml and the code
- HDFS-4591.
Major bug reported by Aaron T. Myers and fixed by Aaron T. Myers (ha , namenode)
HA clients can fail to fail over while Standby NN is performing long checkpoint
- HDFS-4586.
Major bug reported by Ivan Mitic and fixed by Ivan Mitic
TestDataDirs.testGetDataDirsFromURIs fails with all directories in dfs.datanode.data.dir are invalid
- HDFS-4583.
Major bug reported by Ivan Mitic and fixed by Ivan Mitic
TestNodeCount fails
- HDFS-4582.
Major bug reported by Ivan Mitic and fixed by Ivan Mitic
TestHostsFiles fails on Windows
- HDFS-4573.
Major bug reported by Arpit Agarwal and fixed by Arpit Agarwal (test)
Fix TestINodeFile on Windows
- HDFS-4572.
Major bug reported by Arpit Agarwal and fixed by Arpit Agarwal (namenode , test)
Fix TestJournal failures on Windows
- HDFS-4569.
Trivial improvement reported by Andrew Wang and fixed by Andrew Wang
Small image transfer related cleanups.
- HDFS-4565.
Minor improvement reported by Arpit Gupta and fixed by Arpit Gupta (security)
use DFSUtil.getSpnegoKeytabKey() to get the spnego keytab key in secondary namenode and namenode http server
- HDFS-4544.
Major bug reported by Amareshwari Sriramadasu and fixed by Arpit Agarwal
Error in deleting blocks should not do check disk, for all types of errors
- HDFS-4542.
Blocker sub-task reported by Daryn Sharp and fixed by Daryn Sharp (webhdfs)
Webhdfs doesn't support secure proxy users
- HDFS-4541.
Major bug reported by Arpit Gupta and fixed by Arpit Gupta (datanode , security)
set hadoop.log.dir and hadoop.id.str when starting secure datanode so it writes the logs to the correct dir by default
- HDFS-4540.
Major bug reported by Arpit Gupta and fixed by Arpit Gupta (security)
namenode http server should use the web authentication keytab for spnego principal
- HDFS-4533.
Major bug reported by Fengdong Yu and fixed by Fengdong Yu (datanode , namenode)
start-dfs.sh ignored additional parameters besides -upgrade
- HDFS-4532.
Critical bug reported by Daryn Sharp and fixed by Daryn Sharp (namenode)
RPC call queue may fill due to current user lookup
- HDFS-4525.
Major sub-task reported by Uma Maheswara Rao G and fixed by SreeHari (namenode)
Provide an API for knowing that whether file is closed or not.
- HDFS-4522.
Minor bug reported by Colin Patrick McCabe and fixed by Colin Patrick McCabe
LightWeightGSet expects incrementing a volatile to be atomic
- HDFS-4521.
Minor improvement reported by Colin Patrick McCabe and fixed by Colin Patrick McCabe
invalid network topologies should not be cached
- HDFS-4519.
Major bug reported by Chris Nauroth and fixed by Chris Nauroth (datanode , scripts)
Support override of jsvc binary and log file locations when launching secure datanode.
With this improvement the following options are available in release 1.2.0 and later on 1.x release stream:
1. jsvc location can be overridden by setting environment variable JSVC_HOME. Defaults to jsvc binary packaged within the Hadoop distro.
2. jsvc log output is directed to the file defined by JSVC_OUTFILE. Defaults to $HADOOP_LOG_DIR/jsvc.out.
3. jsvc error output is directed to the file defined by JSVC_ERRFILE file. Defaults to $HADOOP_LOG_DIR/jsvc.err.
With this improvement the following options are available in release 2.0.4 and later on 2.x release stream:
1. jsvc log output is directed to the file defined by JSVC_OUTFILE. Defaults to $HADOOP_LOG_DIR/jsvc.out.
2. jsvc error output is directed to the file defined by JSVC_ERRFILE file. Defaults to $HADOOP_LOG_DIR/jsvc.err.
For overriding jsvc location on 2.x releases, here is the release notes from HDFS-2303:
To run secure Datanodes users must install jsvc for their platform and set JSVC_HOME to point to the location of jsvc in their environment.
- HDFS-4518.
Major bug reported by Arpit Agarwal and fixed by Arpit Agarwal
Finer grained metrics for HDFS capacity
- HDFS-4502.
Blocker sub-task reported by Alejandro Abdelnur and fixed by Brandon Li (webhdfs)
WebHdfsFileSystem handling of fileld breaks compatibility
- HDFS-4495.
Major bug reported by Kihwal Lee and fixed by Kihwal Lee (hdfs-client)
Allow client-side lease renewal to be retried beyond soft-limit
- HDFS-4484.
Minor bug reported by Colin Patrick McCabe and fixed by Colin Patrick McCabe
libwebhdfs compilation broken with gcc 4.6.2
- HDFS-4477.
Critical bug reported by Kihwal Lee and fixed by Daryn Sharp (security)
Secondary namenode may retain old tokens
- HDFS-4471.
Major bug reported by Andrew Wang and fixed by Andrew Wang (namenode)
Namenode WebUI file browsing does not work with wildcard addresses configured
- HDFS-4470.
Major bug reported by Chris Nauroth and fixed by Chris Nauroth
several HDFS tests attempt file operations on invalid HDFS paths when running on Windows
- HDFS-4465.
Major improvement reported by Suresh Srinivas and fixed by Aaron T. Myers (datanode)
Optimize datanode ReplicasMap and ReplicaInfo
- HDFS-4461.
Minor improvement reported by Colin Patrick McCabe and fixed by Colin Patrick McCabe
DirectoryScanner: volume path prefix takes up memory for every block that is scanned
- HDFS-4434.
Major sub-task reported by Brandon Li and fixed by Suresh Srinivas (namenode)
Provide a mapping from INodeId to INode
This change adds support for referencing files and directories based on fileID/inodeID using a path /.reserved/.inodes/<inodeid>. With this change creating a file or directory /.reserved is not longer allowed. Before upgrading to a release with this change, files /.reserved needs to be renamed to another name.
- HDFS-4382.
Major bug reported by Ted Yu and fixed by Ted Yu
Fix typo MAX_NOT_CHANGED_INTERATIONS
- HDFS-4374.
Major sub-task reported by Chris Nauroth and fixed by Chris Nauroth (namenode)
Display NameNode startup progress in UI
- HDFS-4373.
Major sub-task reported by Chris Nauroth and fixed by Chris Nauroth (namenode)
Add HTTP API for querying NameNode startup progress
- HDFS-4372.
Major sub-task reported by Chris Nauroth and fixed by Chris Nauroth (namenode)
Track NameNode startup progress
- HDFS-4346.
Minor sub-task reported by Tsz Wo (Nicholas), SZE and fixed by Tsz Wo (Nicholas), SZE (namenode)
Refactor INodeId and GenerationStamp
- HDFS-4342.
Major bug reported by Mark Yang and fixed by Arpit Agarwal (namenode)
Edits dir in dfs.namenode.edits.dir.required will be silently ignored if it is not in dfs.namenode.edits.dir
- HDFS-4340.
Major sub-task reported by Brandon Li and fixed by Brandon Li (hdfs-client , namenode)
Update addBlock() to inculde inode id as additional argument
- HDFS-4339.
Major sub-task reported by Brandon Li and fixed by Brandon Li (namenode)
Persist inode id in fsimage and editlog
- HDFS-4334.
Major sub-task reported by Brandon Li and fixed by Brandon Li (namenode)
Add a unique id to each INode
- HDFS-4305.
Minor bug reported by Todd Lipcon and fixed by Andrew Wang (namenode)
Add a configurable limit on number of blocks per file, and min block size
This change introduces a maximum number of blocks per file, by default one million, and a minimum block size, by default 1MB. These can optionally be changed via the configuration settings "dfs.namenode.fs-limits.max-blocks-per-file" and "dfs.namenode.fs-limits.min-block-size", respectively.
- HDFS-4304.
Major improvement reported by Todd Lipcon and fixed by Colin Patrick McCabe (namenode)
Make FSEditLogOp.MAX_OP_SIZE configurable
- HDFS-4300.
Critical bug reported by Todd Lipcon and fixed by Andrew Wang
TransferFsImage.downloadEditsToStorage should use a tmp file for destination
- HDFS-4298.
Major bug reported by Todd Lipcon and fixed by Aaron T. Myers (namenode)
StorageRetentionManager spews warnings when used with QJM
- HDFS-4296.
Major bug reported by Suresh Srinivas and fixed by Suresh Srinivas (namenode)
Add layout version for HDFS-4256 for release 1.2.0
- HDFS-4287.
Major bug reported by Chris Nauroth and fixed by Chris Nauroth (webhdfs)
HTTPFS tests fail on Windows
- HDFS-4261.
Major bug reported by Tsz Wo (Nicholas), SZE and fixed by Junping Du (balancer)
TestBalancerWithNodeGroup times out
- HDFS-4249.
Major new feature reported by Suresh Srinivas and fixed by Chris Nauroth (namenode)
Add status NameNode startup to webUI
- HDFS-4246.
Minor improvement reported by Harsh J and fixed by Harsh J (hdfs-client)
The exclude node list should be more forgiving, for each output stream
- HDFS-4240.
Major bug reported by Junping Du and fixed by Junping Du (namenode)
In nodegroup-aware case, make sure nodes are avoided to place replica if some replica are already under the same nodegroup
- HDFS-4235.
Minor bug reported by Colin Patrick McCabe and fixed by Colin Patrick McCabe
when outputting XML, OfflineEditsViewer can't handle some edits containing non-ASCII strings
- HDFS-4234.
Minor improvement reported by Tsz Wo (Nicholas), SZE and fixed by Tsz Wo (Nicholas), SZE (balancer)
Use the generic code for choosing datanode in Balancer
- HDFS-4222.
Minor bug reported by Xiaobo Peng and fixed by Xiaobo Peng (namenode)
NN is unresponsive and loses heartbeats of DNs when Hadoop is configured to use LDAP and LDAP has issues
- HDFS-4215.
Major improvement reported by Tsz Wo (Nicholas), SZE and fixed by Tsz Wo (Nicholas), SZE (namenode)
Improvements on INode and image loading
- HDFS-4209.
Major bug reported by Tsz Wo (Nicholas), SZE and fixed by Tsz Wo (Nicholas), SZE (namenode)
Clean up the addNode/addChild/addChildNoQuotaCheck methods in FSDirectory
- HDFS-4206.
Major improvement reported by Tsz Wo (Nicholas), SZE and fixed by Tsz Wo (Nicholas), SZE (namenode)
Change the fields in INode and its subclasses to private
- HDFS-4205.
Major bug reported by Andy Isaacson and fixed by Jason Lowe (hdfs-client)
fsck fails with symlinks
- HDFS-4152.
Minor improvement reported by Tsz Wo (Nicholas), SZE and fixed by Jing Zhao (namenode)
Add a new class for the parameter in INode.collectSubtreeBlocksAndClear(..)
- HDFS-4151.
Minor improvement reported by Tsz Wo (Nicholas), SZE and fixed by Tsz Wo (Nicholas), SZE (namenode)
Passing INodesInPath instead of INode[] in FSDirectory
- HDFS-4129.
Minor test reported by Tsz Wo (Nicholas), SZE and fixed by Tsz Wo (Nicholas), SZE (namenode)
Add utility methods to dump NameNode in memory tree for testing
- HDFS-4128.
Major bug reported by Todd Lipcon and fixed by Kihwal Lee (namenode)
2NN gets stuck in inconsistent state if edit log replay fails in the middle
- HDFS-4124.
Minor new feature reported by Jing Zhao and fixed by Jing Zhao
Refactor INodeDirectory#getExistingPathINodes() to enable returning more than INode array
- HDFS-4053.
Major improvement reported by Eli Collins and fixed by Eli Collins
Increase the default block size
The default blocks size prior to this change was 64MB. This jira changes the default block size to 128MB. To go back to previous behavior, please configure the in hdfs-site.xml, the configuration parameter "dfs.blocksize" to 67108864.
- HDFS-4013.
Trivial bug reported by Chao Shi and fixed by Chao Shi (hdfs-client)
TestHftpURLTimeouts throws NPE
- HDFS-3940.
Minor improvement reported by Eli Collins and fixed by Suresh Srinivas
Add Gset#clear method and clear the block map when namenode is shutdown
- HDFS-3934.
Minor bug reported by Andy Isaacson and fixed by Colin Patrick McCabe
duplicative dfs_hosts entries handled wrong
- HDFS-3880.
Minor improvement reported by Brandon Li and fixed by Brandon Li (datanode , ha , namenode , security)
Use Builder to get RPC server in HDFS
- HDFS-3875.
Critical bug reported by Todd Lipcon and fixed by Kihwal Lee (datanode , hdfs-client)
Issue handling checksum errors in write pipeline
- HDFS-3817.
Major improvement reported by Brandon Li and fixed by Brandon Li (namenode)
avoid printing stack information for SafeModeException
- HDFS-3792.
Trivial bug reported by Todd Lipcon and fixed by Todd Lipcon (build , namenode)
Fix two findbugs introduced by HDFS-3695
- HDFS-3769.
Critical sub-task reported by liaowenrui and fixed by (ha)
standby namenode become active fails because starting log segment fail on shared storage
- HDFS-3601.
Major new feature reported by Junping Du and fixed by Junping Du (namenode)
Implementation of ReplicaPlacementPolicyNodeGroup to support 4-layer network topology
- HDFS-3499.
Major bug reported by Junping Du and fixed by Junping Du (datanode)
Make NetworkTopology support user specified topology class
- HDFS-3498.
Major improvement reported by Junping Du and fixed by Junping Du (namenode)
Make Replica Removal Policy pluggable and ReplicaPlacementPolicyDefault extensible for reusing code in subclass
- HDFS-3495.
Major new feature reported by Junping Du and fixed by Junping Du (balancer)
Update Balancer to support new NetworkTopology with NodeGroup
- HDFS-3277.
Major bug reported by Colin Patrick McCabe and fixed by Andrew Wang
fail over to loading a different FSImage if the first one we try to load is corrupt
- HDFS-3180.
Major bug reported by Daryn Sharp and fixed by Chris Nauroth (webhdfs)
Add socket timeouts to webhdfs
- HDFS-3163.
Trivial improvement reported by Brandon Li and fixed by Brandon Li (test)
TestHDFSCLI.testAll fails if the user name is not all lowercase
- HDFS-3009.
Trivial bug reported by Hari Mankude and fixed by Hari Mankude (hdfs-client)
DFSClient islocaladdress() can use similar routine in netutils
- HDFS-2857.
Major improvement reported by Suresh Srinivas and fixed by Suresh Srinivas (namenode)
Cleanup BlockInfo class
- HDFS-2576.
Major new feature reported by Pritam Damania and fixed by Devaraj Das (hdfs-client , namenode)
Namenode should have a favored nodes hint to enable clients to have control over block placement.
- HDFS-2572.
Trivial improvement reported by Harsh J and fixed by Harsh J (datanode)
Unnecessary double-check in DN#getHostName
- HDFS-2042.
Minor improvement reported by Eli Collins and fixed by (libhdfs)
Require c99 when building libhdfs
- HDFS-1804.
Minor new feature reported by Harsh J and fixed by Aaron T. Myers (datanode)
Add a new block-volume device choosing policy that looks at free space
There is now a new option to have the DN take into account available disk space on each volume when choosing where to place a replica when performing an HDFS write. This can be enabled by setting the config "dfs.datanode.fsdataset.volume.choosing.policy" to the value "org.apache.hadoop.hdfs.server.datanode.fsdataset.AvailableSpaceVolumeChoosingPolicy".
- HDFS-347.
Major improvement reported by George Porter and fixed by Colin Patrick McCabe (datanode , hdfs-client , performance)
DFS read performance suboptimal when client co-located on nodes with data
- HADOOP-9792.
Major improvement reported by Suresh Srinivas and fixed by Suresh Srinivas (ipc)
Retry the methods that are tagged @AtMostOnce along with @Idempotent
- HADOOP-9786.
Major bug reported by Jing Zhao and fixed by Jing Zhao
RetryInvocationHandler#isRpcInvocation should support ProtocolTranslator
- HADOOP-9773.
Minor bug reported by Tsz Wo (Nicholas), SZE and fixed by Tsz Wo (Nicholas), SZE (test)
TestLightWeightCache fails
- HADOOP-9770.
Minor improvement reported by Suresh Srinivas and fixed by Suresh Srinivas (util)
Make RetryCache#state non volatile
- HADOOP-9763.
Major new feature reported by Tsz Wo (Nicholas), SZE and fixed by Tsz Wo (Nicholas), SZE (util)
Extends LightWeightGSet to support eviction of expired elements
- HADOOP-9762.
Major improvement reported by Suresh Srinivas and fixed by Suresh Srinivas (util)
RetryCache utility for implementing RPC retries
- HADOOP-9760.
Major improvement reported by Suresh Srinivas and fixed by Suresh Srinivas (util)
Move GSet and LightWeightGSet to hadoop-common
- HADOOP-9759.
Critical bug reported by Chuan Liu and fixed by Chuan Liu
Add support for NativeCodeLoader#getLibraryName on Windows
- HADOOP-9756.
Minor improvement reported by Junping Du and fixed by Junping Du (ipc)
Additional cleanup RPC code
- HADOOP-9754.
Minor improvement reported by Tsz Wo (Nicholas), SZE and fixed by Tsz Wo (Nicholas), SZE (ipc)
Clean up RPC code
- HADOOP-9751.
Major improvement reported by Tsz Wo (Nicholas), SZE and fixed by Tsz Wo (Nicholas), SZE (ipc)
Add clientId and retryCount to RpcResponseHeaderProto
- HADOOP-9738.
Major bug reported by Kihwal Lee and fixed by Jing Zhao (tools)
TestDistCh fails
- HADOOP-9734.
Minor improvement reported by Jason Lowe and fixed by Jason Lowe (ipc)
Common protobuf definitions for GetUserMappingsProtocol, RefreshAuthorizationPolicyProtocol and RefreshUserMappingsProtocol
- HADOOP-9720.
Major sub-task reported by Suresh Srinivas and fixed by Arpit Agarwal
Rename Client#uuid to Client#clientId
- HADOOP-9717.
Major improvement reported by Suresh Srinivas and fixed by Jing Zhao (ipc)
Add retry attempt count to the RPC requests
- HADOOP-9716.
Major improvement reported by Suresh Srinivas and fixed by Tsz Wo (Nicholas), SZE (ipc)
Move the Rpc request call ID generation to client side InvocationHandler
- HADOOP-9707.
Minor bug reported by Todd Lipcon and fixed by Todd Lipcon (util)
Fix register lists for crc32c inline assembly
- HADOOP-9701.
Minor bug reported by Steve Loughran and fixed by Karthik Kambatla (documentation)
mvn site ambiguous links in hadoop-common
- HADOOP-9698.
Blocker sub-task reported by Daryn Sharp and fixed by Daryn Sharp (ipc)
RPCv9 client must honor server's SASL negotiate response
The RPC client now waits for the Server's SASL negotiate response before instantiating its SASL client.
- HADOOP-9691.
Minor improvement reported by Chris Nauroth and fixed by Chris Nauroth (ipc)
RPC clients can generate call ID using AtomicInteger instead of synchronizing on the Client instance.
- HADOOP-9688.
Blocker improvement reported by Suresh Srinivas and fixed by Suresh Srinivas (ipc)
Add globally unique Client ID to RPC requests
- HADOOP-9683.
Blocker sub-task reported by Luke Lu and fixed by Daryn Sharp (ipc)
Wrap IpcConnectionContext in RPC headers
Connection context is now sent as a rpc header wrapped protobuf.
- HADOOP-9681.
Minor bug reported by Chuan Liu and fixed by Chuan Liu
FileUtil.unTarUsingJava() should close the InputStream upon finishing
- HADOOP-9678.
Major bug reported by Ivan Mitic and fixed by Ivan Mitic
TestRPC#testStopsAllThreads intermittently fails on Windows
- HADOOP-9676.
Minor improvement reported by Colin Patrick McCabe and fixed by Colin Patrick McCabe
make maximum RPC buffer size configurable
- HADOOP-9673.
Trivial improvement reported by Colin Patrick McCabe and fixed by Colin Patrick McCabe (net)
NetworkTopology: when a node can't be added, print out its location for diagnostic purposes
- HADOOP-9665.
Critical bug reported by Zhijie Shen and fixed by Zhijie Shen
BlockDecompressorStream#decompress will throw EOFException instead of return -1 when EOF
- HADOOP-9661.
Major improvement reported by Sandy Ryza and fixed by Sandy Ryza (metrics)
Allow metrics sources to be extended
- HADOOP-9656.
Minor bug reported by Chuan Liu and fixed by Chuan Liu (test , tools)
Gridmix unit tests fail on Windows and Linux
- HADOOP-9649.
Blocker improvement reported by Zhijie Shen and fixed by Zhijie Shen
Promote YARN service life-cycle libraries into Hadoop Common
- HADOOP-9643.
Minor bug reported by Mark Miller and fixed by Mark Miller (security)
org.apache.hadoop.security.SecurityUtil calls toUpperCase(Locale.getDefault()) as well as toLowerCase(Locale.getDefault()) on hadoop.security.authentication value.
- HADOOP-9638.
Major bug reported by Chris Nauroth and fixed by Andrey Klochkov (test)
parallel test changes caused invalid test path for several HDFS tests on Windows
- HADOOP-9637.
Major bug reported by Chuan Liu and fixed by Chuan Liu
Adding Native Fstat for Windows as needed by YARN
- HADOOP-9632.
Minor bug reported by Chuan Liu and fixed by Chuan Liu
TestShellCommandFencer will fail if there is a 'host' machine in the network
- HADOOP-9630.
Major sub-task reported by Luke Lu and fixed by Junping Du (ipc)
Remove IpcSerializationType
- HADOOP-9625.
Minor improvement reported by Paul Han and fixed by (bin , conf)
HADOOP_OPTS not picked up by hadoop command
- HADOOP-9624.
Minor test reported by Xi Fang and fixed by Xi Fang (test)
TestFSMainOperationsLocalFileSystem failed when the Hadoop test root path has "X" in its name
- HADOOP-9619.
Major sub-task reported by Sanjay Radia and fixed by Sanjay Radia (documentation)
Mark stability of .proto files
- HADOOP-9607.
Minor bug reported by Timothy St. Clair and fixed by (documentation)
Fixes in Javadoc build
- HADOOP-9605.
Major improvement reported by Timothy St. Clair and fixed by (build)
Update junit dependency
- HADOOP-9604.
Minor improvement reported by Jingguo Yao and fixed by Jingguo Yao (fs)
Wrong Javadoc of FSDataOutputStream
- HADOOP-9599.
Major bug reported by Mostafa Elhemali and fixed by Mostafa Elhemali
hadoop-config.cmd doesn't set JAVA_LIBRARY_PATH correctly
- HADOOP-9593.
Major bug reported by Steve Loughran and fixed by Steve Loughran (util)
stack trace printed at ERROR for all yarn clients without hadoop.home set
- HADOOP-9581.
Major bug reported by Ashwin Shankar and fixed by Ashwin Shankar (scripts)
hadoop --config non-existent directory should result in error
- HADOOP-9574.
Major bug reported by Jian He and fixed by Jian He
Add new methods in AbstractDelegationTokenSecretManager for restoring RMDelegationTokens on RMRestart
- HADOOP-9566.
Major bug reported by Lenni Kuff and fixed by Colin Patrick McCabe (native)
Performing direct read using libhdfs sometimes raises SIGPIPE (which in turn throws SIGABRT) causing client crashes
- HADOOP-9563.
Major bug reported by Kihwal Lee and fixed by Tian Hong Wang (util)
Fix incompatibility introduced by HADOOP-9523
- HADOOP-9560.
Minor improvement reported by Tsuyoshi OZAWA and fixed by Tsuyoshi OZAWA (metrics)
metrics2#JvmMetrics should have max memory size of JVM
- HADOOP-9556.
Major bug reported by Chris Nauroth and fixed by Chris Nauroth (ha , test)
disable HA tests on Windows that fail due to ZooKeeper client connection management bug
- HADOOP-9553.
Major bug reported by Arpit Agarwal and fixed by Arpit Agarwal (test)
TestAuthenticationToken fails on Windows
- HADOOP-9550.
Major bug reported by Karthik Kambatla and fixed by Karthik Kambatla
Remove aspectj dependency
- HADOOP-9549.
Blocker bug reported by Kihwal Lee and fixed by Daryn Sharp (security)
WebHdfsFileSystem hangs on close()
- HADOOP-9532.
Minor bug reported by Chris Nauroth and fixed by Chris Nauroth (bin)
HADOOP_CLIENT_OPTS is appended twice by Windows cmd scripts
- HADOOP-9526.
Major bug reported by Arpit Agarwal and fixed by Arpit Agarwal (test)
TestShellCommandFencer and TestShell fail on Windows
- HADOOP-9524.
Major bug reported by Arpit Agarwal and fixed by Arpit Agarwal (ha)
Fix ShellCommandFencer to work on Windows
- HADOOP-9523.
Major improvement reported by Tian Hong Wang and fixed by Tian Hong Wang
Provide a generic IBM java vendor flag in PlatformName.java to support non-Sun JREs
- HADOOP-9517.
Blocker bug reported by Arun C Murthy and fixed by Karthik Kambatla (documentation)
Document Hadoop Compatibility
- HADOOP-9515.
Major new feature reported by Brandon Li and fixed by Brandon Li
Add general interface for NFS and Mount
- HADOOP-9511.
Major improvement reported by Omkar Vinit Joshi and fixed by Omkar Vinit Joshi
Adding support for additional input streams (FSDataInputStream and RandomAccessFile) in SecureIOUtils.
- HADOOP-9509.
Major new feature reported by Brandon Li and fixed by Brandon Li
Implement ONCRPC and XDR
- HADOOP-9507.
Minor bug reported by Mostafa Elhemali and fixed by Chris Nauroth (fs)
LocalFileSystem rename() is broken in some cases when destination exists
- HADOOP-9504.
Critical bug reported by Liang Xie and fixed by Liang Xie (metrics)
MetricsDynamicMBeanBase has concurrency issues in createMBeanInfo
- HADOOP-9503.
Minor improvement reported by Varun Sharma and fixed by Varun Sharma (ipc)
Remove sleep between IPC client connect timeouts
- HADOOP-9500.
Major bug reported by Chris Nauroth and fixed by Chris Nauroth (test)
TestUserGroupInformation#testGetServerSideGroups fails on Windows due to failure to find winutils.exe
- HADOOP-9496.
Critical bug reported by Gopal V and fixed by Harsh J (bin)
Bad merge of HADOOP-9450 on branch-2 breaks all bin/hadoop calls that need HADOOP_CLASSPATH
- HADOOP-9490.
Major bug reported by Ivan Mitic and fixed by Ivan Mitic (fs)
LocalFileSystem#reportChecksumFailure not closing the checksum file handle before rename
- HADOOP-9488.
Major bug reported by Chris Nauroth and fixed by Chris Nauroth (fs)
FileUtil#createJarWithClassPath only substitutes environment variables from current process environment/does not support overriding when launching new process
- HADOOP-9486.
Major bug reported by Vinod Kumar Vavilapalli and fixed by Chris Nauroth
Promote Windows and Shell related utils from YARN to Hadoop Common
- HADOOP-9485.
Minor bug reported by Colin Patrick McCabe and fixed by Colin Patrick McCabe (net)
No default value in the code for hadoop.rpc.socket.factory.class.default
- HADOOP-9483.
Major improvement reported by Chris Nauroth and fixed by Arpit Agarwal (util)
winutils support for readlink command
- HADOOP-9481.
Minor bug reported by Vadim Bondarev and fixed by Vadim Bondarev
Broken conditional logic with HADOOP_SNAPPY_LIBRARY
- HADOOP-9473.
Trivial bug reported by Glen Mazza and fixed by (fs)
typo in FileUtil copy() method
- HADOOP-9469.
Major bug reported by Thomas Graves and fixed by Robert Parker
mapreduce/yarn source jars not included in dist tarball
- HADOOP-9459.
Critical bug reported by Vinay and fixed by Vinay (ha)
ActiveStandbyElector can join election even before Service HEALTHY, and results in null data at ActiveBreadCrumb
- HADOOP-9455.
Minor bug reported by Sangjin Lee and fixed by Chris Nauroth (bin)
HADOOP_CLIENT_OPTS appended twice causes JVM failures
- HADOOP-9451.
Major bug reported by Junping Du and fixed by Junping Du (net)
Node with one topology layer should be handled as fault topology when NodeGroup layer is enabled
- HADOOP-9450.
Major improvement reported by Mitch Wyle and fixed by Harsh J (scripts)
HADOOP_USER_CLASSPATH_FIRST is not honored; CLASSPATH is PREpended instead of APpended
- HADOOP-9443.
Major bug reported by Chuan Liu and fixed by Chuan Liu
Port winutils static code analysis change to trunk
- HADOOP-9439.
Minor bug reported by Colin Patrick McCabe and fixed by Colin Patrick McCabe (native)
JniBasedUnixGroupsMapping: fix some crash bugs
- HADOOP-9437.
Major bug reported by Chris Nauroth and fixed by Chris Nauroth (test)
TestNativeIO#testRenameTo fails on Windows due to assumption that POSIX errno is embedded in NativeIOException
- HADOOP-9430.
Major bug reported by Amir Sanjar and fixed by (security)
TestSSLFactory fails on IBM JVM
- HADOOP-9429.
Major bug reported by Amir Sanjar and fixed by (test)
TestConfiguration fails with IBM JAVA
- HADOOP-9425.
Major sub-task reported by Sanjay Radia and fixed by Sanjay Radia (ipc)
Add error codes to rpc-response
- HADOOP-9421.
Blocker sub-task reported by Sanjay Radia and fixed by Daryn Sharp
Convert SASL to use ProtoBuf and provide negotiation capabilities
Raw SASL protocol now uses protobufs wrapped with RPC headers.
The negotiation sequence incorporates the state of the exchange.
The server now has the ability to advertise its supported auth types.
- HADOOP-9418.
Major sub-task reported by Andrew Wang and fixed by Andrew Wang (fs)
Add symlink resolution support to DistributedFileSystem
- HADOOP-9416.
Major sub-task reported by Andrew Wang and fixed by Andrew Wang (fs)
Add new symlink resolution methods in FileSystem and FileSystemLinkResolver
- HADOOP-9414.
Major sub-task reported by Andrew Wang and fixed by Andrew Wang (fs)
Refactor out FSLinkResolver and relevant helper methods
- HADOOP-9413.
Major bug reported by Ivan Mitic and fixed by Ivan Mitic
Introduce common utils for File#setReadable/Writable/Executable and File#canRead/Write/Execute that work cross-platform
- HADOOP-9408.
Minor bug reported by rajeshbabu and fixed by rajeshbabu (conf)
misleading description for net.topology.table.file.name property in core-default.xml
- HADOOP-9407.
Major bug reported by Sangjin Lee and fixed by Sangjin Lee (build)
commons-daemon 1.0.3 dependency has bad group id causing build issues
- HADOOP-9405.
Minor bug reported by Andrew Wang and fixed by Andrew Wang (test , tools)
TestGridmixSummary#testExecutionSummarizer is broken
- HADOOP-9401.
Major improvement reported by Karthik Kambatla and fixed by Karthik Kambatla
CodecPool: Add counters for number of (de)compressors leased out
- HADOOP-9399.
Minor bug reported by Todd Lipcon and fixed by Konstantin Boudnik (build)
protoc maven plugin doesn't work on mvn 3.0.2
Committed to 2.0.4-alpha branch
- HADOOP-9397.
Major bug reported by Jason Lowe and fixed by Chris Nauroth (build)
Incremental dist tar build fails
- HADOOP-9388.
Major bug reported by Ivan Mitic and fixed by Ivan Mitic
TestFsShellCopy fails on Windows
- HADOOP-9380.
Major sub-task reported by Sanjay Radia and fixed by Sanjay Radia (ipc)
Add totalLength to rpc response
- HADOOP-9379.
Trivial improvement reported by Arpit Gupta and fixed by Arpit Gupta
capture the ulimit info after printing the log to the console
- HADOOP-9376.
Major bug reported by Ivan Mitic and fixed by Ivan Mitic
TestProxyUserFromEnv fails on a Windows domain joined machine
- HADOOP-9373.
Minor bug reported by Suresh Srinivas and fixed by Suresh Srinivas
Merge CHANGES.branch-trunk-win.txt to CHANGES.txt
- HADOOP-9369.
Major bug reported by Karthik Kambatla and fixed by Karthik Kambatla (net)
DNS#reverseDns() can return hostname with . appended at the end
- HADOOP-9365.
Major bug reported by Ivan Mitic and fixed by Ivan Mitic
TestHAZKUtil fails on Windows
- HADOOP-9364.
Major bug reported by Ivan Mitic and fixed by Ivan Mitic
PathData#expandAsGlob does not return correct results for absolute paths on Windows
- HADOOP-9358.
Major bug reported by Todd Lipcon and fixed by Todd Lipcon (ipc , security)
"Auth failed" log should include exception string
- HADOOP-9355.
Major sub-task reported by Andrew Wang and fixed by Andrew Wang (fs)
Abstract symlink tests to use either FileContext or FileSystem
- HADOOP-9353.
Major bug reported by Arpit Agarwal and fixed by Arpit Agarwal (build)
Activate native-win profile by default on Windows
- HADOOP-9352.
Major improvement reported by Daryn Sharp and fixed by Daryn Sharp (security)
Expose UGI.setLoginUser for tests
- HADOOP-9349.
Major bug reported by Sandy Ryza and fixed by Sandy Ryza (tools)
Confusing output when running hadoop version from one hadoop installation when HADOOP_HOME points to another
- HADOOP-9343.
Major improvement reported by Siddharth Seth and fixed by Siddharth Seth
Allow additional exceptions through the RPC layer
- HADOOP-9342.
Major bug reported by Thomas Weise and fixed by Thomas Weise (build)
Remove jline from distribution
- HADOOP-9339.
Major bug reported by Daryn Sharp and fixed by Daryn Sharp (ipc)
IPC.Server incorrectly sets UGI auth type
- HADOOP-9338.
Major new feature reported by Nick White and fixed by Nick White (fs)
FsShell Copy Commands Should Optionally Preserve File Attributes
- HADOOP-9337.
Major bug reported by Ivan A. Veselovsky and fixed by Ivan A. Veselovsky
org.apache.hadoop.fs.DF.getMount() does not work on Mac OS
- HADOOP-9336.
Critical improvement reported by Daryn Sharp and fixed by Daryn Sharp (ipc)
Allow UGI of current connection to be queried
- HADOOP-9334.
Minor improvement reported by Nicolas Liochon and fixed by Nicolas Liochon (build)
Update netty version
- HADOOP-9323.
Minor bug reported by Hao Zhong and fixed by Suresh Srinivas (documentation , fs , io , record)
Typos in API documentation
- HADOOP-9322.
Minor improvement reported by Harsh J and fixed by Harsh J (security)
LdapGroupsMapping doesn't seem to set a timeout for its directory search
- HADOOP-9318.
Minor improvement reported by Colin Patrick McCabe and fixed by Colin Patrick McCabe
when exiting on a signal, print the signal name first
- HADOOP-9307.
Major bug reported by Todd Lipcon and fixed by Todd Lipcon (fs)
BufferedFSInputStream.read returns wrong results after certain seeks
- HADOOP-9305.
Major bug reported by Aaron T. Myers and fixed by Aaron T. Myers (security)
Add support for running the Hadoop client on 64-bit AIX
- HADOOP-9304.
Major bug reported by Alejandro Abdelnur and fixed by Alejandro Abdelnur (build)
remove addition of avro genreated-sources dirs to build
- HADOOP-9303.
Major bug reported by Thomas Graves and fixed by Andy Isaacson
command manual dfsadmin missing entry for restoreFailedStorage option
- HADOOP-9302.
Major bug reported by Thomas Graves and fixed by Andy Isaacson (documentation)
HDFS docs not linked from top level
- HADOOP-9299.
Blocker bug reported by Roman Shaposhnik and fixed by Daryn Sharp (security)
kerberos name resolution is kicking in even when kerberos is not configured
- HADOOP-9297.
Major bug reported by Alejandro Abdelnur and fixed by Alejandro Abdelnur
remove old record IO generation and tests
- HADOOP-9294.
Major bug reported by Chris Nauroth and fixed by Chris Nauroth (test)
GetGroupsTestBase fails on Windows
- HADOOP-9290.
Major bug reported by Arpit Agarwal and fixed by Chris Nauroth (build , native)
Some tests cannot load native library
- HADOOP-9287.
Major test reported by Tsuyoshi OZAWA and fixed by Andrey Klochkov (test)
Parallel testing hadoop-common
- HADOOP-9283.
Major new feature reported by Aaron T. Myers and fixed by Aaron T. Myers (security)
Add support for running the Hadoop client on AIX
- HADOOP-9279.
Major improvement reported by Tsuyoshi OZAWA and fixed by Tsuyoshi OZAWA (build , documentation)
Document the need to build hadoop-maven-plugins for eclipse and separate project builds
- HADOOP-9267.
Minor bug reported by Andrew Wang and fixed by Andrew Wang
hadoop -help, -h, --help should show usage instructions
- HADOOP-9264.
Major bug reported by Chris Nauroth and fixed by Chris Nauroth (fs)
port change to use Java untar API on Windows from branch-1-win to trunk
- HADOOP-9253.
Major improvement reported by Arpit Gupta and fixed by Arpit Gupta
Capture ulimit info in the logs at service start time
- HADOOP-9246.
Major bug reported by Karthik Kambatla and fixed by Karthik Kambatla (build)
Execution phase for hadoop-maven-plugin should be process-resources
- HADOOP-9245.
Major bug reported by Karthik Kambatla and fixed by Karthik Kambatla (build)
mvn clean without running mvn install before fails
- HADOOP-9233.
Major test reported by Vadim Bondarev and fixed by Vadim Bondarev
Cover package org.apache.hadoop.io.compress.zlib with unit tests
- HADOOP-9230.
Major bug reported by Karthik Kambatla and fixed by Karthik Kambatla (test)
TestUniformSizeInputFormat fails intermittently
- HADOOP-9222.
Major test reported by Vadim Bondarev and fixed by Vadim Bondarev
Cover package with org.apache.hadoop.io.lz4 unit tests
- HADOOP-9220.
Critical bug reported by Tom White and fixed by Tom White (ha)
Unnecessary transition to standby in ActiveStandbyElector
- HADOOP-9218.
Major sub-task reported by Sanjay Radia and fixed by Sanjay Radia (ipc)
Document the Rpc-wrappers used internally
- HADOOP-9211.
Major bug reported by Sarah Weissman and fixed by Plamen Jeliazkov (conf)
HADOOP_CLIENT_OPTS default setting fixes max heap size at 128m, disregards HADOOP_HEAPSIZE
- HADOOP-9209.
Major new feature reported by Todd Lipcon and fixed by Todd Lipcon (fs , tools)
Add shell command to dump file checksums
- HADOOP-9164.
Minor improvement reported by Binglin Chang and fixed by Binglin Chang (native)
Print paths of loaded native libraries in NativeLibraryChecker
- HADOOP-9163.
Major sub-task reported by Sanjay Radia and fixed by Sanjay Radia (ipc)
The rpc msg in ProtobufRpcEngine.proto should be moved out to avoid an extra copy
- HADOOP-9154.
Major bug reported by Karthik Kambatla and fixed by Karthik Kambatla (io)
SortedMapWritable#putAll() doesn't add key/value classes to the map
- HADOOP-9151.
Major sub-task reported by Sanjay Radia and fixed by Sanjay Radia (ipc)
Include RPC error info in RpcResponseHeader instead of sending it separately
- HADOOP-9150.
Critical bug reported by Todd Lipcon and fixed by Todd Lipcon (fs/s3 , ha , performance , viewfs)
Unnecessary DNS resolution attempts for logical URIs
- HADOOP-9140.
Major sub-task reported by Sanjay Radia and fixed by Sanjay Radia (ipc)
Cleanup rpc PB protos
- HADOOP-9131.
Major bug reported by Chris Nauroth and fixed by Chris Nauroth (test)
TestLocalFileSystem#testListStatusWithColons cannot run on Windows
- HADOOP-9125.
Major bug reported by Kai Zheng and fixed by Kai Zheng (security)
LdapGroupsMapping threw CommunicationException after some idle time
- HADOOP-9117.
Major improvement reported by Alejandro Abdelnur and fixed by Alejandro Abdelnur (build)
replace protoc ant plugin exec with a maven plugin
- HADOOP-9043.
Major bug reported by Chris Nauroth and fixed by Chris Nauroth (util)
disallow in winutils creating symlinks with forwards slashes
- HADOOP-8982.
Major bug reported by Chris Nauroth and fixed by Chris Nauroth (net)
TestSocketIOWithTimeout fails on Windows
- HADOOP-8973.
Major bug reported by Chris Nauroth and fixed by Chris Nauroth (util)
DiskChecker cannot reliably detect an inaccessible disk on Windows with NTFS ACLs
- HADOOP-8958.
Major bug reported by Chris Nauroth and fixed by Chris Nauroth (viewfs)
ViewFs:Non absolute mount name failures when running multiple tests on Windows
- HADOOP-8957.
Major bug reported by Chris Nauroth and fixed by Chris Nauroth (fs)
AbstractFileSystem#IsValidName should be overridden for embedded file systems like ViewFs
- HADOOP-8924.
Major improvement reported by Chris Nauroth and fixed by Chris Nauroth (build)
Add maven plugin alternative to shell script to save package-info.java
- HADOOP-8917.
Major bug reported by Arpit Gupta and fixed by Arpit Gupta
add LOCALE.US to toLowerCase in SecurityUtil.replacePattern
- HADOOP-8886.
Major improvement reported by Eli Collins and fixed by Eli Collins (fs)
Remove KFS support
Kosmos FS (KFS) is no longer maintained and Hadoop support has been removed. KFS has been replaced by QFS (HADOOP-8885).
- HADOOP-8711.
Major improvement reported by Brandon Li and fixed by Brandon Li (ipc)
provide an option for IPC server users to avoid printing stack information for certain exceptions
- HADOOP-8569.
Minor bug reported by Colin Patrick McCabe and fixed by Colin Patrick McCabe
CMakeLists.txt: define _GNU_SOURCE and _LARGEFILE_SOURCE
- HADOOP-8562.
Major new feature reported by Bikas Saha and fixed by Bikas Saha
Enhancements to support Hadoop on Windows Server and Windows Azure environments
This umbrella jira makes enhancements to support Hadoop natively on Windows Server and Windows Azure environments.
- HADOOP-8470.
Major sub-task reported by Junping Du and fixed by Junping Du
Implementation of 4-layer subclass of NetworkTopology (NetworkTopologyWithNodeGroup)
This patch should be checked in together (or after) with JIRA Hadoop-8469: https://issues.apache.org/jira/browse/HADOOP-8469
- HADOOP-8469.
Major sub-task reported by Junping Du and fixed by Junping Du
Make NetworkTopology class pluggable
- HADOOP-8462.
Major improvement reported by Govind Kamat and fixed by Govind Kamat (io)
Native-code implementation of bzip2 codec
- HADOOP-8440.
Minor bug reported by Ivan Mitic and fixed by Ivan Mitic (fs)
HarFileSystem.decodeHarURI fails for URIs whose host contains numbers
- HADOOP-8415.
Minor improvement reported by Jan van der Lugt and fixed by Jan van der Lugt (conf)
getDouble() and setDouble() in org.apache.hadoop.conf.Configuration
- HADOOP-7487.
Major bug reported by Todd Lipcon and fixed by Andrew Wang (fs)
DF should throw a more reasonable exception when mount cannot be determined
- HADOOP-7391.
Major bug reported by Sanjay Radia and fixed by Sanjay Radia
Document Interface Classification from HADOOP-5073