hadoop/hadoop-mapreduce-client/hadoop-mapreduce-client-core/manifest_committer.html

1039 lines
61 KiB
HTML

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!--
| Generated by Apache Maven Doxia at 2023-03-25
| Rendered using Apache Maven Stylus Skin 1.5
-->
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Apache Hadoop 3.4.0-SNAPSHOT &#x2013; The Manifest Committer for Azure and Google Cloud Storage</title>
<style type="text/css" media="all">
@import url("./css/maven-base.css");
@import url("./css/maven-theme.css");
@import url("./css/site.css");
</style>
<link rel="stylesheet" href="./css/print.css" type="text/css" media="print" />
<meta name="Date-Revision-yyyymmdd" content="20230325" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
</head>
<body class="composite">
<div id="banner">
<a href="http://hadoop.apache.org/" id="bannerLeft">
<img src="http://hadoop.apache.org/images/hadoop-logo.jpg" alt="" />
</a>
<a href="http://www.apache.org/" id="bannerRight">
<img src="http://www.apache.org/images/asf_logo_wide.png" alt="" />
</a>
<div class="clear">
<hr/>
</div>
</div>
<div id="breadcrumbs">
<div class="xright"> <a href="http://wiki.apache.org/hadoop" class="externalLink">Wiki</a>
|
<a href="https://gitbox.apache.org/repos/asf/hadoop.git" class="externalLink">git</a>
|
<a href="http://hadoop.apache.org/" class="externalLink">Apache Hadoop</a>
&nbsp;| Last Published: 2023-03-25
&nbsp;| Version: 3.4.0-SNAPSHOT
</div>
<div class="clear">
<hr/>
</div>
</div>
<div id="leftColumn">
<div id="navcolumn">
<h5>General</h5>
<ul>
<li class="none">
<a href="../../index.html">Overview</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/SingleCluster.html">Single Node Setup</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/ClusterSetup.html">Cluster Setup</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/CommandsManual.html">Commands Reference</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/FileSystemShell.html">FileSystem Shell</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/Compatibility.html">Compatibility Specification</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/DownstreamDev.html">Downstream Developer's Guide</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/AdminCompatibilityGuide.html">Admin Compatibility Guide</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/InterfaceClassification.html">Interface Classification</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/filesystem/index.html">FileSystem Specification</a>
</li>
</ul>
<h5>Common</h5>
<ul>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/CLIMiniCluster.html">CLI Mini Cluster</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/FairCallQueue.html">Fair Call Queue</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/NativeLibraries.html">Native Libraries</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/Superusers.html">Proxy User</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/RackAwareness.html">Rack Awareness</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/SecureMode.html">Secure Mode</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/ServiceLevelAuth.html">Service Level Authorization</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/HttpAuthentication.html">HTTP Authentication</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/CredentialProviderAPI.html">Credential Provider API</a>
</li>
<li class="none">
<a href="../../hadoop-kms/index.html">Hadoop KMS</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/Tracing.html">Tracing</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/UnixShellGuide.html">Unix Shell Guide</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/registry/index.html">Registry</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/AsyncProfilerServlet.html">Async Profiler</a>
</li>
</ul>
<h5>HDFS</h5>
<ul>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsDesign.html">Architecture</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html">User Guide</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HDFSCommands.html">Commands Reference</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html">NameNode HA With QJM</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html">NameNode HA With NFS</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/ObserverNameNode.html">Observer NameNode</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/Federation.html">Federation</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/ViewFs.html">ViewFs</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/ViewFsOverloadScheme.html">ViewFsOverloadScheme</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsSnapshots.html">Snapshots</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsEditsViewer.html">Edits Viewer</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsImageViewer.html">Image Viewer</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html">Permissions and HDFS</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsQuotaAdminGuide.html">Quotas and HDFS</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/LibHdfs.html">libhdfs (C API)</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/WebHDFS.html">WebHDFS (REST API)</a>
</li>
<li class="none">
<a href="../../hadoop-hdfs-httpfs/index.html">HttpFS</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html">Short Circuit Local Reads</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html">Centralized Cache Management</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html">NFS Gateway</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsRollingUpgrade.html">Rolling Upgrade</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/ExtendedAttributes.html">Extended Attributes</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/TransparentEncryption.html">Transparent Encryption</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsMultihoming.html">Multihoming</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html">Storage Policies</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/MemoryStorage.html">Memory Storage Support</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/SLGUserGuide.html">Synthetic Load Generator</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.html">Erasure Coding</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HDFSDiskbalancer.html">Disk Balancer</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsUpgradeDomain.html">Upgrade Domain</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsDataNodeAdminGuide.html">DataNode Admin</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs-rbf/HDFSRouterFederation.html">Router Federation</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsProvidedStorage.html">Provided Storage</a>
</li>
</ul>
<h5>MapReduce</h5>
<ul>
<li class="none">
<a href="../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html">Tutorial</a>
</li>
<li class="none">
<a href="../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapredCommands.html">Commands Reference</a>
</li>
<li class="none">
<a href="../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html">Compatibility with 1.x</a>
</li>
<li class="none">
<a href="../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html">Encrypted Shuffle</a>
</li>
<li class="none">
<a href="../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/PluggableShuffleAndPluggableSort.html">Pluggable Shuffle/Sort</a>
</li>
<li class="none">
<a href="../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistributedCacheDeploy.html">Distributed Cache Deploy</a>
</li>
<li class="none">
<a href="../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/SharedCacheSupport.html">Support for YARN Shared Cache</a>
</li>
</ul>
<h5>MapReduce REST APIs</h5>
<ul>
<li class="none">
<a href="../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapredAppMasterRest.html">MR Application Master</a>
</li>
<li class="none">
<a href="../../hadoop-mapreduce-client/hadoop-mapreduce-client-hs/HistoryServerRest.html">MR History Server</a>
</li>
</ul>
<h5>YARN</h5>
<ul>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/YARN.html">Architecture</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/YarnCommands.html">Commands Reference</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html">Capacity Scheduler</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/FairScheduler.html">Fair Scheduler</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/ResourceManagerRestart.html">ResourceManager Restart</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/ResourceManagerHA.html">ResourceManager HA</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/ResourceModel.html">Resource Model</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/NodeLabel.html">Node Labels</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/NodeAttributes.html">Node Attributes</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/WebApplicationProxy.html">Web Application Proxy</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/TimelineServer.html">Timeline Server</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/TimelineServiceV2.html">Timeline Service V.2</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html">Writing YARN Applications</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/YarnApplicationSecurity.html">YARN Application Security</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/NodeManager.html">NodeManager</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/DockerContainers.html">Running Applications in Docker Containers</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/RuncContainers.html">Running Applications in runC Containers</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html">Using CGroups</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/SecureContainer.html">Secure Containers</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/ReservationSystem.html">Reservation System</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/GracefulDecommission.html">Graceful Decommission</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/OpportunisticContainers.html">Opportunistic Containers</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/Federation.html">YARN Federation</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/SharedCache.html">Shared Cache</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/UsingGpus.html">Using GPU</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/UsingFPGA.html">Using FPGA</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/PlacementConstraints.html">Placement Constraints</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/YarnUI2.html">YARN UI2</a>
</li>
</ul>
<h5>YARN REST APIs</h5>
<ul>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/WebServicesIntro.html">Introduction</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html">Resource Manager</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/NodeManagerRest.html">Node Manager</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/TimelineServer.html#Timeline_Server_REST_API_v1">Timeline Server</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/TimelineServiceV2.html#Timeline_Service_v.2_REST_API">Timeline Service V.2</a>
</li>
</ul>
<h5>YARN Service</h5>
<ul>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/yarn-service/Overview.html">Overview</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/yarn-service/QuickStart.html">QuickStart</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/yarn-service/Concepts.html">Concepts</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/yarn-service/YarnServiceAPI.html">Yarn Service API</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/yarn-service/ServiceDiscovery.html">Service Discovery</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/yarn-service/SystemServices.html">System Services</a>
</li>
</ul>
<h5>Hadoop Compatible File Systems</h5>
<ul>
<li class="none">
<a href="../../hadoop-aliyun/tools/hadoop-aliyun/index.html">Aliyun OSS</a>
</li>
<li class="none">
<a href="../../hadoop-aws/tools/hadoop-aws/index.html">Amazon S3</a>
</li>
<li class="none">
<a href="../../hadoop-azure/index.html">Azure Blob Storage</a>
</li>
<li class="none">
<a href="../../hadoop-azure-datalake/index.html">Azure Data Lake Storage</a>
</li>
<li class="none">
<a href="../../hadoop-cos/cloud-storage/index.html">Tencent COS</a>
</li>
<li class="none">
<a href="../../hadoop-huaweicloud/cloud-storage/index.html">Huaweicloud OBS</a>
</li>
</ul>
<h5>Auth</h5>
<ul>
<li class="none">
<a href="../../hadoop-auth/index.html">Overview</a>
</li>
<li class="none">
<a href="../../hadoop-auth/Examples.html">Examples</a>
</li>
<li class="none">
<a href="../../hadoop-auth/Configuration.html">Configuration</a>
</li>
<li class="none">
<a href="../../hadoop-auth/BuildingIt.html">Building</a>
</li>
</ul>
<h5>Tools</h5>
<ul>
<li class="none">
<a href="../../hadoop-streaming/HadoopStreaming.html">Hadoop Streaming</a>
</li>
<li class="none">
<a href="../../hadoop-archives/HadoopArchives.html">Hadoop Archives</a>
</li>
<li class="none">
<a href="../../hadoop-archive-logs/HadoopArchiveLogs.html">Hadoop Archive Logs</a>
</li>
<li class="none">
<a href="../../hadoop-distcp/DistCp.html">DistCp</a>
</li>
<li class="none">
<a href="../../hadoop-federation-balance/HDFSFederationBalance.html">HDFS Federation Balance</a>
</li>
<li class="none">
<a href="../../hadoop-gridmix/GridMix.html">GridMix</a>
</li>
<li class="none">
<a href="../../hadoop-rumen/Rumen.html">Rumen</a>
</li>
<li class="none">
<a href="../../hadoop-resourceestimator/ResourceEstimator.html">Resource Estimator Service</a>
</li>
<li class="none">
<a href="../../hadoop-sls/SchedulerLoadSimulator.html">Scheduler Load Simulator</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/Benchmarking.html">Hadoop Benchmarking</a>
</li>
<li class="none">
<a href="../../hadoop-dynamometer/Dynamometer.html">Dynamometer</a>
</li>
</ul>
<h5>Reference</h5>
<ul>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/release/">Changelog and Release Notes</a>
</li>
<li class="none">
<a href="../../api/index.html">Java API docs</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/UnixShellAPI.html">Unix Shell API</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/Metrics.html">Metrics</a>
</li>
</ul>
<h5>Configuration</h5>
<ul>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/core-default.xml">core-default.xml</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/hdfs-default.xml">hdfs-default.xml</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs-rbf/hdfs-rbf-default.xml">hdfs-rbf-default.xml</a>
</li>
<li class="none">
<a href="../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml">mapred-default.xml</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-common/yarn-default.xml">yarn-default.xml</a>
</li>
<li class="none">
<a href="../../hadoop-kms/kms-default.html">kms-default.xml</a>
</li>
<li class="none">
<a href="../../hadoop-hdfs-httpfs/httpfs-default.html">httpfs-default.xml</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/DeprecatedProperties.html">Deprecated Properties</a>
</li>
</ul>
<a href="http://maven.apache.org/" title="Built by Maven" class="poweredBy">
<img alt="Built by Maven" src="./images/logos/maven-feather.png"/>
</a>
</div>
</div>
<div id="bodyColumn">
<div id="contentBox">
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<h1>The Manifest Committer for Azure and Google Cloud Storage</h1>
<p>This document how to use the <i>Manifest Committer</i>.</p>
<p>The <i>Manifest</i> committer is a committer for work which provides performance on ABFS for &#x201c;real world&#x201d; queries, and performance and correctness on GCS. It also works with other filesystems, including HDFS. However, the design is optimized for object stores where listing operatons are slow and expensive.</p>
<p>The architecture and implementation of the committer is covered in <a href="manifest_committer_architecture.html">Manifest Committer Architecture</a>.</p>
<p>The protocol and its correctness are covered in <a href="manifest_committer_protocol.html">Manifest Committer Protocol</a>.</p>
<p>It was added in March 2022, and should be considered unstable in early releases.</p>
<ul>
<li><a href="#Problem:">Problem:</a></li>
<li><a href="#Solution.">Solution.</a></li>
<li><a href="#Binding_to_the_manifest_committer_in_Spark.">Binding to the manifest committer in Spark.</a>
<ul>
<li><a href="#Using_the_Cloudstore_committerinfo_command_to_probe_committer_bindings."> Using the Cloudstore committerinfo command to probe committer bindings.</a></li></ul></li>
<li><a href="#Verifying_that_the_committer_was_used">Verifying that the committer was used</a></li>
<li><a href="#Scaling_jobs_mapreduce.manifest.committer.io.threads"> Scaling jobs mapreduce.manifest.committer.io.threads</a></li>
<li><a href="#Optional:_deleting_target_files_in_Job_Commit"> Optional: deleting target files in Job Commit</a></li>
<li><a href="#Viewing__SUCCESS_file_files_through_the_ManifestPrinter_tool."> Viewing _SUCCESS file files through the ManifestPrinter tool.</a></li>
<li><a href="#Collecting_Job_Summaries_mapreduce.manifest.committer.summary.report.directory"> Collecting Job Summaries mapreduce.manifest.committer.summary.report.directory</a></li>
<li><a href="#Full_set_of_ABFS_options_for_spark"> Full set of ABFS options for spark</a></li>
<li><a href="#Experimental:_ABFS_Rename_Rate_Limiting_fs.azure.io.rate.limit">Experimental: ABFS Rename Rate Limiting fs.azure.io.rate.limit</a></li>
<li><a href="#Advanced_Configuration_options">Advanced Configuration options</a></li>
<li><a href="#Validating_output__mapreduce.manifest.committer.validate.output">Validating output mapreduce.manifest.committer.validate.output</a></li>
<li><a href="#Controlling_storage_integration_mapreduce.manifest.committer.store.operations.classname">Controlling storage integration mapreduce.manifest.committer.store.operations.classname</a></li>
<li><a href="#Support_for_concurrent_jobs_to_the_same_directory"> Support for concurrent jobs to the same directory</a></li></ul>
<section>
<h2><a name="Problem:"></a>Problem:</h2>
<p>The only committer of work from Spark to Azure ADLS Gen 2 &#x201c;<a class="externalLink" href="abfs://">abfs://</a>&#x201d; storage which is safe to use is the &#x201c;v1 file committer&#x201d;.</p>
<p>This is &#x201c;correct&#x201d; in that if a task attempt fails, its output is guaranteed not to be included in the final out. The &#x201c;v2&#x201d; commit algorithm cannot meet that guarantee, which is why it is no longer the default.</p>
<p>But: it is slow, especially on jobs where deep directory trees of output are used. Why is it slow? It&#x2019;s hard to point at a particular cause, primarily because of the lack of any instrumentation in the <code>FileOutputCommitter</code>. Stack traces of running jobs generally show <code>rename()</code>, though list operations do surface too.</p>
<p>On Google GCS, neither the v1 nor v2 algorithm are <i>safe</i> because the google filesystem doesn&#x2019;t have the atomic directory rename which the v1 algorithm requires.</p>
<p>A further issue is that both Azure and GCS storage may encounter scale issues with deleting directories with many descendants. This can trigger timeouts because the FileOutputCommitter assumes that cleaning up after the job is a fast call to <code>delete(&quot;_temporary&quot;, true)</code>.</p></section><section>
<h2><a name="Solution."></a>Solution.</h2>
<p>The <i>Intermediate Manifest</i> committer is a new committer for work which should deliver performance on ABFS for &#x201c;real world&#x201d; queries, and performance and correctness on GCS.</p>
<p>This committer uses the extension point which came in for the S3A committers. Users can declare a new committer factory for <a class="externalLink" href="abfs://">abfs://</a> and <a class="externalLink" href="gcs://">gcs://</a> URLs. A suitably configured spark deployment will pick up the new committer.</p>
<p>Directory performance issues in job cleanup can be addressed by two options 1. The committer will parallelize deletion of task attempt directories before deleting the <code>_temporary</code> directory. 1. Cleanup can be disabled. .</p>
<p>The committer can be used with any filesystem client which has a &#x201c;real&#x201d; file rename() operation. It has been optimised for remote object stores where listing and file probes are expensive -the design is less likely to offer such signifcant speedup on HDFS -though the parallel renaming operations will speed up jobs there compared to the classic v1 algorithm.</p>
<h1><a name="how"></a> How it works</h1>
<p>The full details are covered in <a href="manifest_committer_architecture.html">Manifest Committer Architecture</a>.</p>
<h1><a name="use"></a> Using the committer</h1>
<p>The hooks put in to support the S3A committers were designed to allow every filesystem schema to provide their own committer. See <a href="../../hadoop-aws/tools/hadoop-aws/committers.html#Switching_to_an_S3A_Committer">Switching To an S3A Committer</a></p>
<p>A factory for the abfs schema would be defined in <code>mapreduce.outputcommitter.factory.scheme.abfs</code> ; and a similar one for <code>gcs</code>.</p>
<p>Some matching spark configuration changes, especially for parquet binding, will be required. These can be done in <code>core-site.xml</code>, if it is not defined in the <code>mapred-default.xml</code> JAR.</p>
<div class="source">
<div class="source">
<pre>&lt;property&gt;
&lt;name&gt;mapreduce.outputcommitter.factory.scheme.abfs&lt;/name&gt;
&lt;value&gt;org.apache.hadoop.fs.azurebfs.commit.AzureManifestCommitterFactory&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;mapreduce.outputcommitter.factory.scheme.gs&lt;/name&gt;
&lt;value&gt;org.apache.hadoop.mapreduce.lib.output.committer.manifest.ManifestCommitterFactory&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
</section><section>
<h2><a name="Binding_to_the_manifest_committer_in_Spark."></a>Binding to the manifest committer in Spark.</h2>
<p>In Apache Spark, the configuration can be done either with command line options (after the &#x2018;&#x2013;conf&#x2019;) or by using the <code>spark-defaults.conf</code> file. The following is an example of using <code>spark-defaults.conf</code> also including the configuration for Parquet with a subclass of the parquet committer which uses the factory mechansim internally.</p>
<div class="source">
<div class="source">
<pre>spark.hadoop.mapreduce.outputcommitter.factory.scheme.abfs org.apache.hadoop.fs.azurebfs.commit.AzureManifestCommitterFactory
spark.hadoop.mapreduce.outputcommitter.factory.scheme.gs org.apache.hadoop.mapreduce.lib.output.committer.manifest.ManifestCommitterFactory
spark.sql.parquet.output.committer.class org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter
spark.sql.sources.commitProtocolClass org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
</pre></div></div>
<section>
<h3><a name="Using_the_Cloudstore_committerinfo_command_to_probe_committer_bindings."></a><a name="committerinfo"></a> Using the Cloudstore <code>committerinfo</code> command to probe committer bindings.</h3>
<p>The hadoop committer settings can be validated in a recent build of <a class="externalLink" href="https://github.com/steveloughran/cloudstore">cloudstore</a> and its <code>committerinfo</code> command. This command instantiates a committer for that path through the same factory mechanism as MR and spark jobs use, then prints its <code>toString</code> value.</p>
<div class="source">
<div class="source">
<pre>hadoop jar cloudstore-1.0.jar committerinfo abfs://testing@ukwest.dfs.core.windows.net/
2021-09-16 19:42:59,731 [main] INFO commands.CommitterInfo (StoreDurationInfo.java:&lt;init&gt;(53)) - Starting: Create committer
Committer factory for path abfs://testing@ukwest.dfs.core.windows.net/ is
org.apache.hadoop.mapreduce.lib.output.committer.manifest.ManifestCommitterFactory@3315d2d7
(classname org.apache.hadoop.mapreduce.lib.output.committer.manifest.ManifestCommitterFactory)
2021-09-16 19:43:00,897 [main] INFO manifest.ManifestCommitter (ManifestCommitter.java:&lt;init&gt;(144)) - Created ManifestCommitter with
JobID job__0000, Task Attempt attempt__0000_r_000000_1 and destination abfs://testing@ukwest.dfs.core.windows.net/
Created committer of class org.apache.hadoop.mapreduce.lib.output.committer.manifest.ManifestCommitter:
ManifestCommitter{ManifestCommitterConfig{destinationDir=abfs://testing@ukwest.dfs.core.windows.net/,
role='task committer',
taskAttemptDir=abfs://testing@ukwest.dfs.core.windows.net/_temporary/manifest_job__0000/0/_temporary/attempt__0000_r_000000_1,
createJobMarker=true,
jobUniqueId='job__0000',
jobUniqueIdSource='JobID',
jobAttemptNumber=0,
jobAttemptId='job__0000_0',
taskId='task__0000_r_000000',
taskAttemptId='attempt__0000_r_000000_1'},
iostatistics=counters=();
gauges=();
minimums=();
maximums=();
means=();
}
</pre></div></div>
</section></section><section>
<h2><a name="Verifying_that_the_committer_was_used"></a>Verifying that the committer was used</h2>
<p>The new committer will write a JSON summary of the operation, including statistics, in the <code>_SUCCESS</code> file.</p>
<p>If this file exists and is zero bytes long: the classic <code>FileOutputCommitter</code> was used.</p>
<p>If this file exists and is greater than zero bytes long, either the manifest committer was used, or in the case of S3A filesystems, one of the S3A committers. They all use the same JSON format.</p>
<h1><a name="configuration"></a> Configuration options</h1>
<p>Here are the main configuration options of the committer.</p>
<table border="0" class="bodyTable">
<thead>
<tr class="a">
<th> Option </th>
<th> Meaning </th>
<th> Default Value </th></tr>
</thead><tbody>
<tr class="b">
<td> <code>mapreduce.manifest.committer.delete.target.files</code> </td>
<td> Delete target files? </td>
<td> <code>false</code> </td></tr>
<tr class="a">
<td> <code>mapreduce.manifest.committer.io.threads</code> </td>
<td> Thread count for parallel operations </td>
<td> <code>64</code> </td></tr>
<tr class="b">
<td> <code>mapreduce.manifest.committer.summary.report.directory</code> </td>
<td> directory to save reports. </td>
<td> <code>&quot;&quot;</code> </td></tr>
<tr class="a">
<td> <code>mapreduce.manifest.committer.cleanup.parallel.delete</code> </td>
<td> Delete temporary directories in parallel </td>
<td> <code>true</code> </td></tr>
<tr class="b">
<td> <code>mapreduce.fileoutputcommitter.cleanup.skipped</code> </td>
<td> Skip cleanup of <code>_temporary</code> directory</td>
<td> <code>false</code> </td></tr>
<tr class="a">
<td> <code>mapreduce.fileoutputcommitter.cleanup-failures.ignored</code> </td>
<td> Ignore errors during cleanup </td>
<td> <code>false</code> </td></tr>
<tr class="b">
<td> <code>mapreduce.fileoutputcommitter.marksuccessfuljobs</code> </td>
<td> Create a <code>_SUCCESS</code> marker file on successful completion. (and delete any existing one in job setup) </td>
<td> <code>true</code> </td></tr>
</tbody>
</table>
<p>There are some more, as covered in the (Advanced)[#advanced] section.</p></section><section>
<h2><a name="Scaling_jobs_mapreduce.manifest.committer.io.threads"></a><a name="scaling"></a> Scaling jobs <code>mapreduce.manifest.committer.io.threads</code></h2>
<p>The core reason that this committer is faster than the classic <code>FileOutputCommitter</code> is that it tries to parallelize as much file IO as it can during job commit, specifically:</p>
<ul>
<li>task manifest loading</li>
<li>deletion of files where directories will be created</li>
<li>directory creation</li>
<li>file-by-file renaming</li>
<li>deletion of task attempt directories in job cleanup</li>
</ul>
<p>These operations are all performed in the same thread pool, whose size is set in the option <code>mapreduce.manifest.committer.io.threads</code>.</p>
<p>Larger values may be used.</p>
<p>XML</p>
<div class="source">
<div class="source">
<pre>&lt;property&gt;
&lt;name&gt;mapreduce.manifest.committer.io.threads&lt;/name&gt;
&lt;value&gt;200&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<p>spark-defaults.conf</p>
<div class="source">
<div class="source">
<pre>spark.hadoop.mapreduce.manifest.committer.io.threads 200
</pre></div></div>
<p>A larger value than that of the number of cores allocated to the MapReduce AM or Spark Driver does not directly overload the CPUs, as the threads are normally waiting for (slow) IO against the object store/filesystem to complete.</p>
<p>Caveats * In Spark, multiple jobs may be committed in the same process, each of which will create their own thread pool during job commit or cleanup. * Azure rate throttling may be triggered if too many IO requests are made against the store. The rate throttling option <code>mapreduce.manifest.committer.io.rate</code> can help avoid this.</p></section><section>
<h2><a name="Optional:_deleting_target_files_in_Job_Commit"></a><a name="deleting"></a> Optional: deleting target files in Job Commit</h2>
<p>The classic <code>FileOutputCommitter</code> deletes files at the destination paths before renaming the job&#x2019;s files into place.</p>
<p>This is optional in the manifest committers, set in the option <code>mapreduce.manifest.committer.delete.target.files</code> with a default value of <code>false</code>.</p>
<p>This increases performance and is safe to use when all files created by a job have unique filenames.</p>
<p>Apache Spark does generate unique filenames for ORC and Parquet since <a class="externalLink" href="https://issues.apache.org/jira/browse/SPARK-8406">SPARK-8406</a> <i>Adding UUID to output file name to avoid accidental overwriting</i></p>
<p>Avoiding checks for/deleting target files saves one delete call per file being committed, so can save a significant amount of store IO.</p>
<p>When appending to existing tables, using formats other than ORC and parquet, unless confident that unique identifiers are added to each filename, enable deletion of the target files.</p>
<div class="source">
<div class="source">
<pre>spark.hadoop.mapreduce.manifest.committer.delete.target.files true
</pre></div></div>
<p><i>Note 1:</i> the committer will skip deletion operations when it created the directory into which a file is to be renamed. This makes it slightly more efficient, at least if jobs appending data are creating and writing into new partitions.</p>
<p><i>Note 2:</i> the committer still requires tasks within a single job to create unique files. This is foundational for any job to generate correct data.</p>
<h1><a name="dynamic"></a> Spark Dynamic Partition overwriting</h1>
<p>Spark has a feature called &#x201c;Dynamic Partition Overwrites&#x201d;,</p>
<p>This can be initiated in SQL</p>
<div class="source">
<div class="source">
<pre>INSERT OVERWRITE TABLE ...
</pre></div></div>
<p>Or through DataSet writes where the mode is <code>overwrite</code> and the partitioning matches that of the existing table</p>
<div class="source">
<div class="source">
<pre>sparkConf.set(&quot;spark.sql.sources.partitionOverwriteMode&quot;, &quot;dynamic&quot;)
// followed by an overwrite of a Dataset into an existing partitioned table.
eventData2
.write
.mode(&quot;overwrite&quot;)
.partitionBy(&quot;year&quot;, &quot;month&quot;)
.format(&quot;parquet&quot;)
.save(existingDir)
</pre></div></div>
<p>This feature is implemented in Spark, which 1. Directs the job to write its new data to a temporary directory 1. After job commit completes, scans the output to identify the leaf directories &#x201c;partitions&#x201d; into which data was written. 1. Deletes the content of those directories in the destination table 1. Renames the new files into the partitions.</p>
<p>This is all done in spark, which takes over the tasks of scanning the intermediate output tree, deleting partitions and of renaming the new files.</p>
<p>This feature also adds the ability for a job to write data entirely outside the destination table, which is done by 1. writing new files into the working directory 1. spark moving them to the final destination in job commit</p>
<p>The manifest committer is compatible with dynamic partition overwrites on Azure and Google cloud storage as together they meet the core requirements of the extension: 1. The working directory returned in <code>getWorkPath()</code> is in the same filesystem as the final output. 2. <code>rename()</code> is an <code>O(1)</code> operation which is safe and fast to use when committing a job.</p>
<p>None of the S3A committers support this. Condition (1) is not met by the staging committers, while (2) is not met by S3 itself.</p>
<p>To use the manifest committer with dynamic partition overwrites, the spark version must contain <a class="externalLink" href="https://issues.apache.org/jira/browse/SPARK-40034">SPARK-40034</a> <i>PathOutputCommitters to work with dynamic partition overwrite</i>.</p>
<p>Be aware that the rename phase of the operation will be slow if many files are renamed -this is done sequentially. Parallel renaming would speed this up, <i>but could trigger the abfs overload problems the manifest committer is designed to both minimize the risk of and support recovery from</i></p>
<p>The spark side of the commit operation will be listing/treewalking the temporary output directory (some overhead), followed by the file promotion, done with a classic filesystem <code>rename()</code> call. There will be no explicit rate limiting here.</p>
<p><i>What does this mean?</i></p>
<p>It means that _dynamic partitioning should not be used on Azure Storage for SQL queries/Spark DataSet operations where many thousands of files are created. The fact that these will suffer from performance problems before throttling scale issues surface, should be considered a warning.</p>
<h1><a name="SUCCESS"></a> Job Summaries in <code>_SUCCESS</code> files</h1>
<p>The original hadoop committer creates a zero byte <code>_SUCCESS</code> file in the root of the output directory unless disabled.</p>
<p>This committer writes a JSON summary which includes * The name of the committer. * Diagnostics information. * A list of some of the files created (for testing; a full list is excluded as it can get big). * IO Statistics.</p>
<p>If, after running a query, this <code>_SUCCESS</code> file is zero bytes long, <i>the new committer has not been used</i></p>
<p>If it is not empty, then it can be examined.</p></section><section>
<h2><a name="Viewing__SUCCESS_file_files_through_the_ManifestPrinter_tool."></a><a name="printer"></a> Viewing <code>_SUCCESS</code> file files through the <code>ManifestPrinter</code> tool.</h2>
<p>The summary files are JSON, and can be viewed in any text editor.</p>
<p>For a more succinct summary, including better display of statistics, use the <code>ManifestPrinter</code> tool.</p>
<div class="source">
<div class="source">
<pre>hadoop org.apache.hadoop.mapreduce.lib.output.committer.manifest.files.ManifestPrinter &lt;path&gt;
</pre></div></div>
<p>This works for the files saved at the base of an output directory, and any reports saved to a report directory.</p></section><section>
<h2><a name="Collecting_Job_Summaries_mapreduce.manifest.committer.summary.report.directory"></a><a name="summaries"></a> Collecting Job Summaries <code>mapreduce.manifest.committer.summary.report.directory</code></h2>
<p>The committer can be configured to save the <code>_SUCCESS</code> summary files to a report directory, irrespective of whether the job succeed or failed, by setting a fileystem path in the option <code>mapreduce.manifest.committer.summary.report.directory</code>.</p>
<p>The path does not have to be on the same store/filesystem as the destination of work. For example, a local fileystem could be used.</p>
<p>XML</p>
<div class="source">
<div class="source">
<pre>&lt;property&gt;
&lt;name&gt;mapreduce.manifest.committer.summary.report.directory&lt;/name&gt;
&lt;value&gt;file:///tmp/reports&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<p>spark-defaults.conf</p>
<div class="source">
<div class="source">
<pre>spark.hadoop.mapreduce.manifest.committer.summary.report.directory file:///tmp/reports
</pre></div></div>
<p>This allows for the statistics of jobs to be collected irrespective of their outcome, Whether or not saving the <code>_SUCCESS</code> marker is enabled, and without problems caused by a chain of queries overwriting the markers.</p>
<h1><a name="cleanup"></a> Cleanup</h1>
<p>Job cleanup is convoluted as it is designed to address a number of issues which may surface in cloud storage.</p>
<ul>
<li>Slow performance for deletion of directories.</li>
<li>Timeout when deleting very deep and wide directory trees.</li>
<li>General resilience to cleanup issues escalating to job failures.</li>
</ul>
<table border="0" class="bodyTable">
<thead>
<tr class="a">
<th> Option </th>
<th> Meaning </th>
<th> Default Value </th></tr>
</thead><tbody>
<tr class="b">
<td> <code>mapreduce.fileoutputcommitter.cleanup.skipped</code> </td>
<td> Skip cleanup of <code>_temporary</code> directory</td>
<td> <code>false</code> </td></tr>
<tr class="a">
<td> <code>mapreduce.fileoutputcommitter.cleanup-failures.ignored</code> </td>
<td> Ignore errors during cleanup </td>
<td> <code>false</code> </td></tr>
<tr class="b">
<td> <code>mapreduce.manifest.committer.cleanup.parallel.delete</code> </td>
<td> Delete task attempt directories in parallel </td>
<td> <code>true</code> </td></tr>
</tbody>
</table>
<p>The algorithm is:</p>
<div class="source">
<div class="source">
<pre>if `mapreduce.fileoutputcommitter.cleanup.skipped`:
return
if `mapreduce.manifest.committer.cleanup.parallel.delete`:
attempt parallel delete of task directories; catch any exception
if not `mapreduce.fileoutputcommitter.cleanup.skipped`:
delete(`_temporary`); catch any exception
if caught-exception and not `mapreduce.fileoutputcommitter.cleanup-failures.ignored`:
throw caught-exception
</pre></div></div>
<p>It&#x2019;s a bit complicated, but the goal is to perform a fast/scalable delete and throw a meaningful exception if that didn&#x2019;t work.</p>
<p>When working with ABFS and GCS, these settings should normally be left alone. If somehow errors surface during cleanup, enabling the option to ignore failures will ensure the job still completes. Disabling cleanup even avoids the overhead of cleanup, but requires a workflow or manual operation to clean up all <code>_temporary</code> directories on a regular basis.</p>
<h1><a name="abfs"></a> Working with Azure ADLS Gen2 Storage</h1>
<p>To switch to the manifest committer, the factory for committers for destinations with <code>abfs://</code> URLs must be switched to the manifest committer factory, either for the application or the entire cluster.</p>
<div class="source">
<div class="source">
<pre>&lt;property&gt;
&lt;name&gt;mapreduce.outputcommitter.factory.scheme.abfs&lt;/name&gt;
&lt;value&gt;org.apache.hadoop.fs.azurebfs.commit.AzureManifestCommitterFactory&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<p>This allows for ADLS Gen2 -specific performance and consistency logic to be used from within the committer. In particular: * the <a class="externalLink" href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/ETag">Etag</a> header can be collected in listings and used in the job commit phase. * IO rename operations are rate limited * recovery is attempted when throttling triggers rename failures.</p>
<p><i>Warning</i> This committer is not compatible with older Azure storage services (WASB or ADLS Gen 1).</p>
<p>The core set of Azure-optimized options becomes</p>
<div class="source">
<div class="source">
<pre>&lt;property&gt;
&lt;name&gt;mapreduce.outputcommitter.factory.scheme.abfs&lt;/name&gt;
&lt;value&gt;org.apache.hadoop.fs.azurebfs.commit.AzureManifestCommitterFactory&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;spark.hadoop.fs.azure.io.rate.limit&lt;/name&gt;
&lt;value&gt;10000&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<p>And optional settings for debugging/performance analysis</p>
<div class="source">
<div class="source">
<pre>&lt;property&gt;
&lt;name&gt;mapreduce.manifest.committer.summary.report.directory&lt;/name&gt;
&lt;value&gt;abfs:// Path within same store/separate store&lt;/value&gt;
&lt;description&gt;Optional: path to where job summaries are saved&lt;/description&gt;
&lt;/property&gt;
</pre></div></div>
</section><section>
<h2><a name="Full_set_of_ABFS_options_for_spark"></a><a name="abfs-options"></a> Full set of ABFS options for spark</h2>
<div class="source">
<div class="source">
<pre>spark.hadoop.mapreduce.outputcommitter.factory.scheme.abfs org.apache.hadoop.fs.azurebfs.commit.AzureManifestCommitterFactory
spark.hadoop.fs.azure.io.rate.limit 10000
spark.sql.parquet.output.committer.class org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter
spark.sql.sources.commitProtocolClass org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
spark.hadoop.mapreduce.manifest.committer.summary.report.directory (optional: URI of a directory for job summaries)
</pre></div></div>
</section><section>
<h2><a name="Experimental:_ABFS_Rename_Rate_Limiting_fs.azure.io.rate.limit"></a>Experimental: ABFS Rename Rate Limiting <code>fs.azure.io.rate.limit</code></h2>
<p>To avoid triggering store throttling and backoff delays, as well as other throttling-related failure conditions file renames during job commit are throttled through a &#x201c;rate limiter&#x201d; which limits the number of rename operations per second a single instance of the ABFS FileSystem client may issue.</p>
<table border="0" class="bodyTable">
<thead>
<tr class="a">
<th> Option </th>
<th> Meaning </th></tr>
</thead><tbody>
<tr class="b">
<td> <code>fs.azure.io.rate.limit</code> </td>
<td> Rate limit in operations/second for IO operations. </td></tr>
</tbody>
</table>
<p>Set the option to <code>0</code> remove all rate limiting.</p>
<p>The default value of this is set to 10000, which is the default IO capacity for an ADLS storage account.</p>
<div class="source">
<div class="source">
<pre>&lt;property&gt;
&lt;name&gt;fs.azure.io.rate.limit&lt;/name&gt;
&lt;value&gt;10000&lt;/value&gt;
&lt;description&gt;maximum number of renames attempted per second&lt;/description&gt;
&lt;/property&gt;
</pre></div></div>
<p>This capacity is set at the level of the filesystem client, and so not shared across all processes within a single application, let alone other applications sharing the same storage account.</p>
<p>It will be shared with all jobs being committed by the same Spark driver, as these do share that filesystem connector.</p>
<p>If rate limiting is imposed, the statistic <code>store_io_rate_limited</code> will report the time to acquire permits for committing files.</p>
<p>If server-side throttling took place, signs of this can be seen in * The store service&#x2019;s logs and their throttling status codes (usually 503 or 500). * The job statistic <code>commit_file_rename_recovered</code>. This statistic indicates that ADLS throttling manifested as failures in renames, failures which were recovered from in the comitter.</p>
<p>If these are seen -or other applications running at the same time experience throttling/throttling-triggered problems, consider reducing the value of <code>fs.azure.io.rate.limit</code>, and/or requesting a higher IO capacity from Microsoft.</p>
<p><i>Important</i> if you do get extra capacity from Microsoft and you want to use it to speed up job commits, increase the value of <code>fs.azure.io.rate.limit</code> either across the cluster, or specifically for those jobs which you wish to allocate extra priority to.</p>
<p>This is still a work in progress; it may be expanded to support all IO operations performed by a single filesystem instance.</p>
<h1><a name="gcs"></a> Working with Google Cloud Storage</h1>
<p>The manifest committer is compatible with and tested against Google cloud storage through the gcs-connector library from google, which provides a Hadoop filesystem client for the schema <code>gs</code>.</p>
<p>Google cloud storage has the semantics needed for the commit protocol to work safely.</p>
<p>The Spark settings to switch to this committer are</p>
<div class="source">
<div class="source">
<pre>spark.hadoop.mapreduce.outputcommitter.factory.scheme.gs org.apache.hadoop.mapreduce.lib.output.committer.manifest.ManifestCommitterFactory
spark.sql.parquet.output.committer.class org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter
spark.sql.sources.commitProtocolClass org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
spark.hadoop.mapreduce.manifest.committer.summary.report.directory (optional: URI of a directory for job summaries)
</pre></div></div>
<p>The store&#x2019;s directory delete operations are <code>O(files)</code> so the value of <code>mapreduce.manifest.committer.cleanup.parallel.delete</code> SHOULD be left at the default of <code>true</code>.</p>
<p>For mapreduce, declare the binding in <code>core-site.xml</code>or <code>mapred-site.xml</code></p>
<div class="source">
<div class="source">
<pre>&lt;property&gt;
&lt;name&gt;mapreduce.outputcommitter.factory.scheme.gcs&lt;/name&gt;
&lt;value&gt;org.apache.hadoop.mapreduce.lib.output.committer.manifest.ManifestCommitterFactory&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<h1><a name="hdfs"></a> Working with HDFS</h1>
<p>This committer <i>does</i> work with HDFS, it has just been targeted at object stores with reduced performance on some operations, especially listing and renaming, and semantics too reduced for the classic <code>FileOutputCommitter</code> to rely on (specifically GCS).</p>
<p>To use on HDFS, set the <code>ManifestCommitterFactory</code> as the committer factory for <code>hdfs://</code> URLs.</p>
<p>Because HDFS does fast directory deletion, there is no need to parallelize deletion of task attempt directories during cleanup, so set <code>mapreduce.manifest.committer.cleanup.parallel.delete</code> to <code>false</code></p>
<p>The final spark bindings becomes</p>
<div class="source">
<div class="source">
<pre>spark.hadoop.mapreduce.outputcommitter.factory.scheme.hdfs org.apache.hadoop.mapreduce.lib.output.committer.manifest.ManifestCommitterFactory
spark.hadoop.mapreduce.manifest.committer.cleanup.parallel.delete false
spark.sql.parquet.output.committer.class org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter
spark.sql.sources.commitProtocolClass org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
spark.hadoop.mapreduce.manifest.committer.summary.report.directory (optional: URI of a directory for job summaries)
</pre></div></div>
<h1><a name="advanced"></a> Advanced Topics</h1></section><section>
<h2><a name="Advanced_Configuration_options"></a>Advanced Configuration options</h2>
<p>There are some advanced options which are intended for development and testing, rather than production use.</p>
<table border="0" class="bodyTable">
<thead>
<tr class="a">
<th> Option </th>
<th> Meaning </th>
<th> Default Value </th></tr>
</thead><tbody>
<tr class="b">
<td> <code>mapreduce.manifest.committer.store.operations.classname</code> </td>
<td> Classname for Manifest Store Operations </td>
<td> <code>&quot;&quot;</code> </td></tr>
<tr class="a">
<td> <code>mapreduce.manifest.committer.validate.output</code> </td>
<td> Perform output validation? </td>
<td> <code>false</code> </td></tr>
</tbody>
</table></section><section>
<h2><a name="Validating_output__mapreduce.manifest.committer.validate.output"></a>Validating output <code>mapreduce.manifest.committer.validate.output</code></h2>
<p>The option <code>mapreduce.manifest.committer.validate.output</code> triggers a check of every renamed file to verify it has the expected length.</p>
<p>This adds the overhead of a <code>HEAD</code> request per file, and so is recommended for testing only.</p>
<p>There is no verification of the actual contents.</p></section><section>
<h2><a name="Controlling_storage_integration_mapreduce.manifest.committer.store.operations.classname"></a>Controlling storage integration <code>mapreduce.manifest.committer.store.operations.classname</code></h2>
<p>The manifest committer interacts with filesystems through implementations of the interface <code>ManifestStoreOperations</code>. It is possible to provide custom implementations for store-specific features. There is one of these for ABFS; when the abfs-specific committer factory is used this is automatically set.</p>
<p>It can be explicitly set.</p>
<div class="source">
<div class="source">
<pre>&lt;property&gt;
&lt;name&gt;mapreduce.manifest.committer.store.operations.classname&lt;/name&gt;
&lt;value&gt;org.apache.hadoop.fs.azurebfs.commit.AbfsManifestStoreOperations&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<p>The default implementation may also be configured.</p>
<div class="source">
<div class="source">
<pre>&lt;property&gt;
&lt;name&gt;mapreduce.manifest.committer.store.operations.classname&lt;/name&gt;
&lt;value&gt;org.apache.hadoop.mapreduce.lib.output.committer.manifest.impl.ManifestStoreOperationsThroughFileSystem&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<p>There is no need to alter these values, except when writing new implementations for other stores, something which is only needed if the store provides extra integration support for the committer.</p></section><section>
<h2><a name="Support_for_concurrent_jobs_to_the_same_directory"></a><a name="concurrent"></a> Support for concurrent jobs to the same directory</h2>
<p>It <i>may</i> be possible to run multiple jobs targeting the same directory tree.</p>
<p>For this to work, a number of conditions must be met:</p>
<ul>
<li>When using spark, unique job IDs must be set. This meangs the Spark distribution MUST contain the patches for <a class="externalLink" href="https://issues.apache.org/jira/browse/SPARK-33402">SPARK-33402</a> and <a class="externalLink" href="https://issues.apache.org/jira/browse/SPARK-33230">SPARK-33230</a>.</li>
<li>Cleanup of the <code>_temporary</code> directory must be disabled by setting <code>mapreduce.fileoutputcommitter.cleanup.skipped</code> to <code>true</code>.</li>
<li>All jobs/tasks must create files with unique filenames.</li>
<li>All jobs must create output with the same directory partition structure.</li>
<li>The job/queries MUST NOT be using Spark Dynamic Partitioning &#x201c;INSERT OVERWRITE TABLE&#x201d;; data may be lost. This holds for <i>all</i> committers, not just the manifest committer.</li>
<li>Remember to delete the <code>_temporary</code> directory later!</li>
</ul>
<p>This has <i>NOT BEEN TESTED</i></p></section>
</div>
</div>
<div class="clear">
<hr/>
</div>
<div id="footer">
<div class="xright">
&#169; 2008-2023
Apache Software Foundation
- <a href="http://maven.apache.org/privacy-policy.html">Privacy Policy</a>.
Apache Maven, Maven, Apache, the Apache feather logo, and the Apache Maven project logos are trademarks of The Apache Software Foundation.
</div>
<div class="clear">
<hr/>
</div>
</div>
</body>
</html>