hadoop/hadoop-distcp/DistCp.html

1202 lines
72 KiB
HTML

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!--
| Generated by Apache Maven Doxia at 2023-03-02
| Rendered using Apache Maven Stylus Skin 1.5
-->
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Apache Hadoop Distributed Copy &#x2013; DistCp Guide</title>
<style type="text/css" media="all">
@import url("./css/maven-base.css");
@import url("./css/maven-theme.css");
@import url("./css/site.css");
</style>
<link rel="stylesheet" href="./css/print.css" type="text/css" media="print" />
<meta name="Date-Revision-yyyymmdd" content="20230302" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
</head>
<body class="composite">
<div id="banner">
<a href="http://hadoop.apache.org/" id="bannerLeft">
<img src="http://hadoop.apache.org/images/hadoop-logo.jpg" alt="" />
</a>
<a href="http://www.apache.org/" id="bannerRight">
<img src="http://www.apache.org/images/asf_logo_wide.png" alt="" />
</a>
<div class="clear">
<hr/>
</div>
</div>
<div id="breadcrumbs">
<div class="xright"> <a href="http://wiki.apache.org/hadoop" class="externalLink">Wiki</a>
|
<a href="https://gitbox.apache.org/repos/asf/hadoop.git" class="externalLink">git</a>
&nbsp;| Last Published: 2023-03-02
&nbsp;| Version: 3.4.0-SNAPSHOT
</div>
<div class="clear">
<hr/>
</div>
</div>
<div id="leftColumn">
<div id="navcolumn">
<h5>General</h5>
<ul>
<li class="none">
<a href="../index.html">Overview</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/SingleCluster.html">Single Node Setup</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/ClusterSetup.html">Cluster Setup</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/CommandsManual.html">Commands Reference</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/FileSystemShell.html">FileSystem Shell</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/Compatibility.html">Compatibility Specification</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/DownstreamDev.html">Downstream Developer's Guide</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/AdminCompatibilityGuide.html">Admin Compatibility Guide</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/InterfaceClassification.html">Interface Classification</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/filesystem/index.html">FileSystem Specification</a>
</li>
</ul>
<h5>Common</h5>
<ul>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/CLIMiniCluster.html">CLI Mini Cluster</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/FairCallQueue.html">Fair Call Queue</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/NativeLibraries.html">Native Libraries</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/Superusers.html">Proxy User</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/RackAwareness.html">Rack Awareness</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/SecureMode.html">Secure Mode</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/ServiceLevelAuth.html">Service Level Authorization</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/HttpAuthentication.html">HTTP Authentication</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/CredentialProviderAPI.html">Credential Provider API</a>
</li>
<li class="none">
<a href="../hadoop-kms/index.html">Hadoop KMS</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/Tracing.html">Tracing</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/UnixShellGuide.html">Unix Shell Guide</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/registry/index.html">Registry</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/AsyncProfilerServlet.html">Async Profiler</a>
</li>
</ul>
<h5>HDFS</h5>
<ul>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsDesign.html">Architecture</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html">User Guide</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HDFSCommands.html">Commands Reference</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html">NameNode HA With QJM</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html">NameNode HA With NFS</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/ObserverNameNode.html">Observer NameNode</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/Federation.html">Federation</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/ViewFs.html">ViewFs</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/ViewFsOverloadScheme.html">ViewFsOverloadScheme</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsSnapshots.html">Snapshots</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsEditsViewer.html">Edits Viewer</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsImageViewer.html">Image Viewer</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html">Permissions and HDFS</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsQuotaAdminGuide.html">Quotas and HDFS</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/LibHdfs.html">libhdfs (C API)</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/WebHDFS.html">WebHDFS (REST API)</a>
</li>
<li class="none">
<a href="../hadoop-hdfs-httpfs/index.html">HttpFS</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html">Short Circuit Local Reads</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html">Centralized Cache Management</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html">NFS Gateway</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsRollingUpgrade.html">Rolling Upgrade</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/ExtendedAttributes.html">Extended Attributes</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/TransparentEncryption.html">Transparent Encryption</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsMultihoming.html">Multihoming</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html">Storage Policies</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/MemoryStorage.html">Memory Storage Support</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/SLGUserGuide.html">Synthetic Load Generator</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.html">Erasure Coding</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HDFSDiskbalancer.html">Disk Balancer</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsUpgradeDomain.html">Upgrade Domain</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsDataNodeAdminGuide.html">DataNode Admin</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs-rbf/HDFSRouterFederation.html">Router Federation</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsProvidedStorage.html">Provided Storage</a>
</li>
</ul>
<h5>MapReduce</h5>
<ul>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html">Tutorial</a>
</li>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapredCommands.html">Commands Reference</a>
</li>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html">Compatibility with 1.x</a>
</li>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html">Encrypted Shuffle</a>
</li>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/PluggableShuffleAndPluggableSort.html">Pluggable Shuffle/Sort</a>
</li>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistributedCacheDeploy.html">Distributed Cache Deploy</a>
</li>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/SharedCacheSupport.html">Support for YARN Shared Cache</a>
</li>
</ul>
<h5>MapReduce REST APIs</h5>
<ul>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapredAppMasterRest.html">MR Application Master</a>
</li>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-hs/HistoryServerRest.html">MR History Server</a>
</li>
</ul>
<h5>YARN</h5>
<ul>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/YARN.html">Architecture</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/YarnCommands.html">Commands Reference</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html">Capacity Scheduler</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/FairScheduler.html">Fair Scheduler</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/ResourceManagerRestart.html">ResourceManager Restart</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/ResourceManagerHA.html">ResourceManager HA</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/ResourceModel.html">Resource Model</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/NodeLabel.html">Node Labels</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/NodeAttributes.html">Node Attributes</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/WebApplicationProxy.html">Web Application Proxy</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/TimelineServer.html">Timeline Server</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/TimelineServiceV2.html">Timeline Service V.2</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html">Writing YARN Applications</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/YarnApplicationSecurity.html">YARN Application Security</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/NodeManager.html">NodeManager</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/DockerContainers.html">Running Applications in Docker Containers</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/RuncContainers.html">Running Applications in runC Containers</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html">Using CGroups</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/SecureContainer.html">Secure Containers</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/ReservationSystem.html">Reservation System</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/GracefulDecommission.html">Graceful Decommission</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/OpportunisticContainers.html">Opportunistic Containers</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/Federation.html">YARN Federation</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/SharedCache.html">Shared Cache</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/UsingGpus.html">Using GPU</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/UsingFPGA.html">Using FPGA</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/PlacementConstraints.html">Placement Constraints</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/YarnUI2.html">YARN UI2</a>
</li>
</ul>
<h5>YARN REST APIs</h5>
<ul>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/WebServicesIntro.html">Introduction</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html">Resource Manager</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/NodeManagerRest.html">Node Manager</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/TimelineServer.html#Timeline_Server_REST_API_v1">Timeline Server</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/TimelineServiceV2.html#Timeline_Service_v.2_REST_API">Timeline Service V.2</a>
</li>
</ul>
<h5>YARN Service</h5>
<ul>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/yarn-service/Overview.html">Overview</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/yarn-service/QuickStart.html">QuickStart</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/yarn-service/Concepts.html">Concepts</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/yarn-service/YarnServiceAPI.html">Yarn Service API</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/yarn-service/ServiceDiscovery.html">Service Discovery</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/yarn-service/SystemServices.html">System Services</a>
</li>
</ul>
<h5>Hadoop Compatible File Systems</h5>
<ul>
<li class="none">
<a href="../hadoop-aliyun/tools/hadoop-aliyun/index.html">Aliyun OSS</a>
</li>
<li class="none">
<a href="../hadoop-aws/tools/hadoop-aws/index.html">Amazon S3</a>
</li>
<li class="none">
<a href="../hadoop-azure/index.html">Azure Blob Storage</a>
</li>
<li class="none">
<a href="../hadoop-azure-datalake/index.html">Azure Data Lake Storage</a>
</li>
<li class="none">
<a href="../hadoop-cos/cloud-storage/index.html">Tencent COS</a>
</li>
<li class="none">
<a href="../hadoop-huaweicloud/cloud-storage/index.html">Huaweicloud OBS</a>
</li>
</ul>
<h5>Auth</h5>
<ul>
<li class="none">
<a href="../hadoop-auth/index.html">Overview</a>
</li>
<li class="none">
<a href="../hadoop-auth/Examples.html">Examples</a>
</li>
<li class="none">
<a href="../hadoop-auth/Configuration.html">Configuration</a>
</li>
<li class="none">
<a href="../hadoop-auth/BuildingIt.html">Building</a>
</li>
</ul>
<h5>Tools</h5>
<ul>
<li class="none">
<a href="../hadoop-streaming/HadoopStreaming.html">Hadoop Streaming</a>
</li>
<li class="none">
<a href="../hadoop-archives/HadoopArchives.html">Hadoop Archives</a>
</li>
<li class="none">
<a href="../hadoop-archive-logs/HadoopArchiveLogs.html">Hadoop Archive Logs</a>
</li>
<li class="none">
<a href="../hadoop-distcp/DistCp.html">DistCp</a>
</li>
<li class="none">
<a href="../hadoop-federation-balance/HDFSFederationBalance.html">HDFS Federation Balance</a>
</li>
<li class="none">
<a href="../hadoop-gridmix/GridMix.html">GridMix</a>
</li>
<li class="none">
<a href="../hadoop-rumen/Rumen.html">Rumen</a>
</li>
<li class="none">
<a href="../hadoop-resourceestimator/ResourceEstimator.html">Resource Estimator Service</a>
</li>
<li class="none">
<a href="../hadoop-sls/SchedulerLoadSimulator.html">Scheduler Load Simulator</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/Benchmarking.html">Hadoop Benchmarking</a>
</li>
<li class="none">
<a href="../hadoop-dynamometer/Dynamometer.html">Dynamometer</a>
</li>
</ul>
<h5>Reference</h5>
<ul>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/release/">Changelog and Release Notes</a>
</li>
<li class="none">
<a href="../api/index.html">Java API docs</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/UnixShellAPI.html">Unix Shell API</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/Metrics.html">Metrics</a>
</li>
</ul>
<h5>Configuration</h5>
<ul>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/core-default.xml">core-default.xml</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/hdfs-default.xml">hdfs-default.xml</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs-rbf/hdfs-rbf-default.xml">hdfs-rbf-default.xml</a>
</li>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml">mapred-default.xml</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-common/yarn-default.xml">yarn-default.xml</a>
</li>
<li class="none">
<a href="../hadoop-kms/kms-default.html">kms-default.xml</a>
</li>
<li class="none">
<a href="../hadoop-hdfs-httpfs/httpfs-default.html">httpfs-default.xml</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/DeprecatedProperties.html">Deprecated Properties</a>
</li>
</ul>
<a href="http://maven.apache.org/" title="Built by Maven" class="poweredBy">
<img alt="Built by Maven" src="./images/logos/maven-feather.png"/>
</a>
</div>
</div>
<div id="bodyColumn">
<div id="contentBox">
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<h1>DistCp Guide</h1><hr />
<ul>
<li><a href="#Overview">Overview</a></li>
<li><a href="#Usage">Usage</a>
<ul>
<li><a href="#Basic_Usage">Basic Usage</a></li>
<li><a href="#Update_and_Overwrite">Update and Overwrite</a></li>
<li><a href="#Sync">Sync</a></li>
</ul>
</li>
<li><a href="#Command_Line_Options">Command Line Options</a></li>
<li><a href="#Architecture_of_DistCp">Architecture of DistCp</a>
<ul>
<li><a href="#DistCp_Driver">DistCp Driver</a></li>
<li><a href="#Copy-listing_Generator">Copy-listing Generator</a></li>
<li><a href="#InputFormats_and_MapReduce_Components">InputFormats and MapReduce Components</a></li>
</ul>
</li>
<li><a href="#Appendix">Appendix</a>
<ul>
<li><a href="#Map_sizing">Map sizing</a></li>
<li><a href="#Copying_Between_Versions_of_HDFS">Copying Between Versions of HDFS</a></li>
<li><a href="#MapReduce_and_other_side-effects">MapReduce and other side-effects</a></li>
</ul>
</li>
<li><a href="#Frequently_Asked_Questions">Frequently Asked Questions</a></li>
</ul><hr /><section>
<h2><a name="Overview"></a>Overview</h2>
<p>DistCp (distributed copy) is a tool used for large inter/intra-cluster copying. It uses MapReduce to effect its distribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list.</p>
<p>[The erstwhile implementation of DistCp] (<a class="externalLink" href="http://hadoop.apache.org/docs/r1.2.1/distcp.html">http://hadoop.apache.org/docs/r1.2.1/distcp.html</a>) has its share of quirks and drawbacks, both in its usage and its extensibility and performance. The purpose of the DistCp refactor was to fix these shortcomings, enabling it to be used and extended programmatically. New paradigms have been introduced to improve runtime and setup performance, while simultaneously retaining the legacy behaviour as default.</p>
<p>This document aims to describe the design of the new DistCp, its spanking new features, their optimal use, and any deviance from the legacy implementation.</p></section><section>
<h2><a name="Usage"></a>Usage</h2><section>
<h3><a name="Basic_Usage"></a>Basic Usage</h3>
<p>The most common invocation of DistCp is an inter-cluster copy:</p>
<div class="source">
<div class="source">
<pre>bash$ hadoop distcp hdfs://nn1:8020/foo/bar \
hdfs://nn2:8020/bar/foo
</pre></div></div>
<p>This will expand the namespace under <code>/foo/bar</code> on nn1 into a temporary file, partition its contents among a set of map tasks, and start a copy on each NodeManager from <code>nn1</code> to <code>nn2</code>.</p>
<p>One can also specify multiple source directories on the command line:</p>
<div class="source">
<div class="source">
<pre>bash$ hadoop distcp hdfs://nn1:8020/foo/a \
hdfs://nn1:8020/foo/b \
hdfs://nn2:8020/bar/foo
</pre></div></div>
<p>Or, equivalently, from a file using the -f option:</p>
<div class="source">
<div class="source">
<pre>bash$ hadoop distcp -f hdfs://nn1:8020/srclist \
hdfs://nn2:8020/bar/foo
</pre></div></div>
<p>Where <code>srclist</code> contains</p>
<div class="source">
<div class="source">
<pre>hdfs://nn1:8020/foo/a
hdfs://nn1:8020/foo/b
</pre></div></div>
<p>When copying from multiple sources, DistCp will abort the copy with an error message if two sources collide, but collisions at the destination are resolved per the <a href="#Command_Line_Options">options</a> specified. By default, files already existing at the destination are skipped (i.e. not replaced by the source file). A count of skipped files is reported at the end of each job, but it may be inaccurate if a copier failed for some subset of its files, but succeeded on a later attempt.</p>
<p>It is important that each NodeManager can reach and communicate with both the source and destination file systems. For HDFS, both the source and destination must be running the same version of the protocol or use a backwards-compatible protocol; see [Copying Between Versions] (#Copying_Between_Versions_of_HDFS).</p>
<p>After a copy, it is recommended that one generates and cross-checks a listing of the source and destination to verify that the copy was truly successful. Since DistCp employs both Map/Reduce and the FileSystem API, issues in or between any of the three could adversely and silently affect the copy. Some have had success running with <code>-update</code> enabled to perform a second pass, but users should be acquainted with its semantics before attempting this.</p>
<p>It&#x2019;s also worth noting that if another client is still writing to a source file, the copy will likely fail. Attempting to overwrite a file being written at the destination should also fail on HDFS. If a source file is (re)moved before it is copied, the copy will fail with a <code>FileNotFoundException</code>.</p>
<p>Please refer to the detailed Command Line Reference for information on all the options available in DistCp.</p></section><section>
<h3><a name="Update_and_Overwrite"></a>Update and Overwrite</h3>
<p><code>-update</code> is used to copy files from source that don&#x2019;t exist at the target or differ from the target version. <code>-overwrite</code> overwrites target-files that exist at the target.</p>
<p>The Update and Overwrite options warrant special attention since their handling of source-paths varies from the defaults in a very subtle manner. Consider a copy from <code>/source/first/</code> and <code>/source/second/</code> to <code>/target/</code>, where the source paths have the following contents:</p>
<div class="source">
<div class="source">
<pre>hdfs://nn1:8020/source/first/1
hdfs://nn1:8020/source/first/2
hdfs://nn1:8020/source/second/10
hdfs://nn1:8020/source/second/20
</pre></div></div>
<p>When DistCp is invoked without <code>-update</code> or <code>-overwrite</code>, the DistCp defaults would create directories <code>first/</code> and <code>second/</code>, under <code>/target</code>. Thus:</p>
<div class="source">
<div class="source">
<pre>distcp hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second hdfs://nn2:8020/target
</pre></div></div>
<p>would yield the following contents in <code>/target</code>:</p>
<div class="source">
<div class="source">
<pre>hdfs://nn2:8020/target/first/1
hdfs://nn2:8020/target/first/2
hdfs://nn2:8020/target/second/10
hdfs://nn2:8020/target/second/20
</pre></div></div>
<p>When either <code>-update</code> or <code>-overwrite</code> is specified, the <b>contents</b> of the source-directories are copied to target, and not the source directories themselves. Thus:</p>
<div class="source">
<div class="source">
<pre>distcp -update hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second hdfs://nn2:8020/target
</pre></div></div>
<p>would yield the following contents in <code>/target</code>:</p>
<div class="source">
<div class="source">
<pre>hdfs://nn2:8020/target/1
hdfs://nn2:8020/target/2
hdfs://nn2:8020/target/10
hdfs://nn2:8020/target/20
</pre></div></div>
<p>By extension, if both source folders contained a file with the same name (say, <code>0</code>), then both sources would map an entry to <code>/target/0</code> at the destination. Rather than to permit this conflict, DistCp will abort.</p>
<p>Now, consider the following copy operation:</p>
<div class="source">
<div class="source">
<pre>distcp hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second hdfs://nn2:8020/target
</pre></div></div>
<p>With sources/sizes:</p>
<div class="source">
<div class="source">
<pre>hdfs://nn1:8020/source/first/1 32
hdfs://nn1:8020/source/first/2 32
hdfs://nn1:8020/source/second/10 64
hdfs://nn1:8020/source/second/20 32
</pre></div></div>
<p>And destination/sizes:</p>
<div class="source">
<div class="source">
<pre>hdfs://nn2:8020/target/1 32
hdfs://nn2:8020/target/10 32
hdfs://nn2:8020/target/20 64
</pre></div></div>
<p>The result will be:</p>
<div class="source">
<div class="source">
<pre>hdfs://nn2:8020/target/1 32
hdfs://nn2:8020/target/2 32
hdfs://nn2:8020/target/10 64
hdfs://nn2:8020/target/20 32
</pre></div></div>
<p><code>1</code> is skipped because the file-length and contents match. <code>2</code> is copied because it doesn&#x2019;t exist at the target. <code>10</code> and <code>20</code> are overwritten since the contents don&#x2019;t match the source.</p>
<p>If <code>-update</code> is used, <code>1</code> is skipped because the file-length and contents match. <code>2</code> is copied because it doesn&#x2019;t exist at the target. <code>10</code> and <code>20</code> are overwritten since the contents don&#x2019;t match the source. However, if <code>-append</code> is additionally used, then only <code>10</code> is overwritten (source length less than destination) and <code>20</code> is appended with the change in file (if the files match up to the destination&#x2019;s original length).</p>
<p>If <code>-overwrite</code> is used, <code>1</code> is overwritten as well.</p></section><section>
<h3><a name="Sync"></a>Sync</h3>
<p><code>-diff</code> option syncs files from a source cluster to a target cluster with a snapshot diff. It copies, renames and removes files in the snapshot diff list.</p>
<p><code>-update</code> option must be included when <code>-diff</code> option is in use.</p>
<p>Most cloud providers don&#x2019;t work well with sync at the moment.</p>
<p>Usage:</p>
<div class="source">
<div class="source">
<pre>hadoop distcp -update -diff &lt;from_snapshot&gt; &lt;to_snapshot&gt; &lt;source&gt; &lt;destination&gt;
</pre></div></div>
<p>Example:</p>
<div class="source">
<div class="source">
<pre>hadoop distcp -update -diff snap1 snap2 /src/ /dst/
</pre></div></div>
<p>The command above applies changes from snapshot <code>snap1</code> to <code>snap2</code> (i.e. snapshot diff from <code>snap1</code> to <code>snap2</code>) in <code>/src/</code> to <code>/dst/</code>. Obviously, it requires <code>/src/</code> to have both snapshots <code>snap1</code> and <code>snap2</code>. But the destination <code>/dst/</code> must also have a snapshot with the same name as <code>&lt;from_snapshot&gt;</code>, in this case <code>snap1</code>. The destination <code>/dst/</code> should not have new file operations (create, rename, delete) since <code>snap1</code>. Note that when this command finishes, a new snapshot <code>snap2</code> will NOT be created at <code>/dst/</code>.</p>
<p><code>-update</code> is required to use <code>-diff</code> option.</p>
<p>For instance, in <code>/src/</code>, if <code>1.txt</code> is added and <code>2.txt</code> is deleted after the creation of <code>snap1</code> and before creation of <code>snap2</code>, the command above will copy <code>1.txt</code> from <code>/src/</code> to <code>/dst/</code> and delete <code>2.txt</code> from <code>/dst/</code>.</p>
<p>Sync behavior will be elaborated using experiments below.</p><section>
<h4><a name="Experiment_1:_Syncing_diff_of_two_adjacent_snapshots"></a>Experiment 1: Syncing diff of two adjacent snapshots</h4>
<p>Some preparations before we start.</p>
<div class="source">
<div class="source">
<pre># Create source and destination directories
hdfs dfs -mkdir /src/ /dst/
# Allow snapshot on source
hdfs dfsadmin -allowSnapshot /src/
# Create a snapshot (empty one)
hdfs dfs -createSnapshot /src/ snap1
# Allow snapshot on destination
hdfs dfsadmin -allowSnapshot /dst/
# Create a from_snapshot with the same name
hdfs dfs -createSnapshot /dst/ snap1
# Put one text file under /src/
echo &quot;This is the 1st text file.&quot; &gt; 1.txt
hdfs dfs -put 1.txt /src/
# Create the second snapshot
hdfs dfs -createSnapshot /src/ snap2
# Put another text file under /src/
echo &quot;This is the 2nd text file.&quot; &gt; 2.txt
hdfs dfs -put 2.txt /src/
# Create the third snapshot
hdfs dfs -createSnapshot /src/ snap3
</pre></div></div>
<p>Then we run distcp sync:</p>
<div class="source">
<div class="source">
<pre>hadoop distcp -update -diff snap1 snap2 /src/ /dst/
</pre></div></div>
<p>The command above should succeed. <code>1.txt</code> will be copied from <code>/src/</code> to <code>/dst/</code>. Again, <code>-update</code> option is required.</p>
<p>If we run the same command again, we will get <code>DistCp sync failed</code> exception because the destination has added a new file <code>1.txt</code> since <code>snap1</code>. That being said, if we remove <code>1.txt</code> manually from <code>/dst/</code> and run the sync, the command will succeed.</p></section><section>
<h4><a name="Experiment_2:_syncing_diff_of_two_non-adjacent_snapshots"></a>Experiment 2: syncing diff of two non-adjacent snapshots</h4>
<p>First do a cleanup from Experiment 1.</p>
<div class="source">
<div class="source">
<pre>hdfs dfs -rm -skipTrash /dst/1.txt
</pre></div></div>
<p>Run sync command, note the <code>&lt;to_snapshot&gt;</code> has been changed from <code>snap2</code> in Experiment 1 to <code>snap3</code>.</p>
<div class="source">
<div class="source">
<pre>hadoop distcp -update -diff snap1 snap3 /src/ /dst/
</pre></div></div>
<p>Both <code>1.txt</code> and <code>2.txt</code> will be copied to <code>/dst/</code>.</p></section><section>
<h4><a name="Experiment_3:_syncing_file_delete_operation"></a>Experiment 3: syncing file delete operation</h4>
<p>Continuing from the end of Experiment 2:</p>
<div class="source">
<div class="source">
<pre>hdfs dfs -rm -skipTrash /dst/2.txt
# Create snap2 at destination, it contains 1.txt
hdfs dfs -createSnapshot /dst/ snap2
# Delete 1.txt from source
hdfs dfs -rm -skipTrash /src/1.txt
# Create snap4 at source, it only contains 2.txt
hdfs dfs -createSnapshot /src/ snap4
</pre></div></div>
<p>Run sync command now:</p>
<div class="source">
<div class="source">
<pre>hadoop distcp -update -diff snap2 snap4 /src/ /dst/
</pre></div></div>
<p><code>2.txt</code> is copied and <code>1.txt</code> is deleted under <code>/dst/</code>.</p>
<p>Note that, though both <code>/src/</code> and <code>/dst/</code> have snapshot with the same name <code>snap2</code>, the snapshots don&#x2019;t need to have the same content. That means, if you have a <code>1.txt</code> in <code>/dst/</code>&#x2019;s <code>snap2</code> but they have different contents, <code>1.txt</code> will still be removed from <code>/dst/</code>. The sync command doesn&#x2019;t check the contents of the files that is going to be deleted. It simply follows the snapshot diff list between <code>&lt;from_snapshot&gt;</code> and &lt;to_snapshot&gt;.</p>
<p>Also, if we delete <code>1.txt</code> from <code>/dst/</code> before creating <code>snap2</code> on <code>/dst/</code> in the steps above, so that <code>/dst/</code>&#x2019;s <code>snap2</code> doesn&#x2019;t have <code>1.txt</code> before running sync command, the command will still succeed. It won&#x2019;t throw exception while trying to delete <code>1.txt</code> from <code>/dst/</code> which doesn&#x2019;t exist.</p></section></section><section>
<h3><a name="raw_Namespace_Extended_Attribute_Preservation"></a>raw Namespace Extended Attribute Preservation</h3>
<p>This section only applies to HDFS.</p>
<p>If the target and all of the source pathnames are in the <code>/.reserved/raw</code> hierarchy, then &#x2018;raw&#x2019; namespace extended attributes will be preserved. &#x2018;raw&#x2019; xattrs are used by the system for internal functions such as encryption meta data. They are only visible to users when accessed through the <code>/.reserved/raw</code> hierarchy.</p>
<p>raw xattrs are preserved based solely on whether /.reserved/raw prefixes are supplied. The -p (preserve, see below) flag does not impact preservation of raw xattrs.</p>
<p>To prevent raw xattrs from being preserved, simply do not use the <code>/.reserved/raw</code> prefix on any of the source and target paths.</p>
<p>If the <code>/.reserved/raw</code>prefix is specified on only a subset of the source and target paths, an error will be displayed and a non-0 exit code returned.</p></section></section><section>
<h2><a name="Command_Line_Options"></a>Command Line Options</h2>
<table border="0" class="bodyTable">
<thead>
<tr class="a">
<th> Flag </th>
<th> Description </th>
<th> Notes </th></tr>
</thead><tbody>
<tr class="b">
<td> <code>-p[rbugpcaxte]</code> </td>
<td> Preserve r: replication number b: block size u: user g: group p: permission c: checksum-type a: ACL x: XAttr t: timestamp e: erasure coding policy </td>
<td> When <code>-update</code> is specified, status updates will <b>not</b> be synchronized unless the file sizes also differ (i.e. unless the file is re-created). If -pa is specified, DistCp preserves the permissions also because ACLs are a super-set of permissions. The option -pr is only valid if both source and target directory are not erasure coded. </td></tr>
<tr class="a">
<td> <code>-i</code> </td>
<td> Ignore failures </td>
<td> As explained in the Appendix, this option will keep more accurate statistics about the copy than the default case. It also preserves logs from failed copies, which can be valuable for debugging. Finally, a failing map will not cause the job to fail before all splits are attempted. </td></tr>
<tr class="b">
<td> <code>-log &lt;logdir&gt;</code> </td>
<td> Write logs to &lt;logdir&gt; </td>
<td> DistCp keeps logs of each file it attempts to copy as map output. If a map fails, the log output will not be retained if it is re-executed. </td></tr>
<tr class="a">
<td> <code>-v</code> </td>
<td> Log additional info (path, size) in the SKIP/COPY log </td>
<td> This option can only be used with -log option. </td></tr>
<tr class="b">
<td> <code>-m &lt;num_maps&gt;</code> </td>
<td> Maximum number of simultaneous copies </td>
<td> Specify the number of maps to copy data. Note that more maps may not necessarily improve throughput. </td></tr>
<tr class="a">
<td> <code>-overwrite</code> </td>
<td> Overwrite destination </td>
<td> If a map fails and <code>-i</code> is not specified, all the files in the split, not only those that failed, will be recopied. As discussed in the Usage documentation, it also changes the semantics for generating destination paths, so users should use this carefully. </td></tr>
<tr class="b">
<td> <code>-update</code> </td>
<td> Overwrite if source and destination differ in size, blocksize, or checksum </td>
<td> As noted in the preceding, this is not a &#x201c;sync&#x201d; operation. The criteria examined are the source and destination file sizes, blocksizes, and checksums; if they differ, the source file replaces the destination file. As discussed in the Usage documentation, it also changes the semantics for generating destination paths, so users should use this carefully. </td></tr>
<tr class="a">
<td> <code>-append</code> </td>
<td> Incremental copy of file with same name but different length </td>
<td> If the source file is greater in length than the destination file, the checksum of the common length part is compared. If the checksum matches, only the difference is copied using read and append functionalities. The -append option only works with <code>-update</code> without <code>-skipcrccheck</code> </td></tr>
<tr class="b">
<td> <code>-f &lt;urilist_uri&gt;</code> </td>
<td> Use list at &lt;urilist_uri&gt; as src list </td>
<td> This is equivalent to listing each source on the command line. The <code>urilist_uri</code> list should be a fully qualified URI. </td></tr>
<tr class="a">
<td> <code>-filters</code> </td>
<td> The path to a file containing a list of pattern strings, one string per line, such that paths matching the pattern will be excluded from the copy. </td>
<td> Support regular expressions specified by java.util.regex.Pattern. </td></tr>
<tr class="b">
<td> <code>-filelimit &lt;n&gt;</code> </td>
<td> Limit the total number of files to be &lt;= n </td>
<td> <b>Deprecated!</b> Ignored in the new DistCp. </td></tr>
<tr class="a">
<td> <code>-sizelimit &lt;n&gt;</code> </td>
<td> Limit the total size to be &lt;= n bytes </td>
<td> <b>Deprecated!</b> Ignored in the new DistCp. </td></tr>
<tr class="b">
<td> <code>-delete</code> </td>
<td> Delete the files existing in the dst but not in src </td>
<td> The deletion is done by FS Shell. So the trash will be used, if it is enable. Delete is applicable only with update or overwrite options. </td></tr>
<tr class="a">
<td> <code>-strategy {dynamic|uniformsize}</code> </td>
<td> Choose the copy-strategy to be used in DistCp. </td>
<td> By default, uniformsize is used. (i.e. Maps are balanced on the total size of files copied by each map. Similar to legacy.) If &#x201c;dynamic&#x201d; is specified, <code>DynamicInputFormat</code> is used instead. (This is described in the Architecture section, under InputFormats.) </td></tr>
<tr class="b">
<td> <code>-bandwidth</code> </td>
<td> Specify bandwidth per map, in MB/second. </td>
<td> Each map will be restricted to consume only the specified bandwidth. This is not always exact. The map throttles back its bandwidth consumption during a copy, such that the <b>net</b> bandwidth used tends towards the specified value. </td></tr>
<tr class="a">
<td> <code>-atomic {-tmp &lt;tmp_dir&gt;}</code> </td>
<td> Specify atomic commit, with optional tmp directory. </td>
<td> <code>-atomic</code> instructs DistCp to copy the source data to a temporary target location, and then move the temporary target to the final-location atomically. Data will either be available at final target in a complete and consistent form, or not at all. Optionally, <code>-tmp</code> may be used to specify the location of the tmp-target. If not specified, a default is chosen. <b>Note:</b> tmp_dir must be on the final target cluster. </td></tr>
<tr class="b">
<td> <code>-async</code> </td>
<td> Run DistCp asynchronously. Quits as soon as the Hadoop Job is launched. </td>
<td> The Hadoop Job-id is logged, for tracking. </td></tr>
<tr class="a">
<td> <code>-diff &lt;oldSnapshot&gt; &lt;newSnapshot&gt;</code> </td>
<td> Use snapshot diff report between given two snapshots to identify the difference between source and target, and apply the diff to the target to make it in sync with source. </td>
<td> This option is valid only with <code>-update</code> option and the following conditions should be satisfied.
<ol style="list-style-type: decimal">
<li> Both the source and the target FileSystem must be DistributedFileSystem.</li>
<li> Two snapshots <code>&lt;oldSnapshot&gt;</code> and <code>&lt;newSnapshot&gt;</code> have been created on the source FS, and <code>&lt;oldSnapshot&gt;</code> is older than <code>&lt;newSnapshot&gt;</code>. </li>
<li> The target has the same snapshot <code>&lt;oldSnapshot&gt;</code>. No changes have been made on the target since <code>&lt;oldSnapshot&gt;</code> was created, thus <code>&lt;oldSnapshot&gt;</code> has the same content as the current state of the target. All the files/directories in the target are the same with source&#x2019;s <code>&lt;oldSnapshot&gt;</code>.</li></ol> </td></tr>
<tr class="b">
<td> <code>-rdiff &lt;newSnapshot&gt; &lt;oldSnapshot&gt;</code> </td>
<td> Use snapshot diff report between given two snapshots to identify what has been changed on the target since the snapshot <code>&lt;oldSnapshot&gt;</code> was created on the target, and apply the diff reversely to the target, and copy modified files from the source&#x2019;s <code>&lt;oldSnapshot&gt;</code>, to make the target the same as <code>&lt;oldSnapshot&gt;</code>. </td>
<td> This option is valid only with <code>-update</code> option and the following conditions should be satisfied.
<ol style="list-style-type: decimal">
<li>Both the source and the target FileSystem must be DistributedFileSystem. The source and the target can be two different clusters/paths, or they can be exactly the same cluster/path. In the latter case, modified files are copied from target&#x2019;s <code>&lt;oldSnapshot&gt;</code> to target&#x2019;s current state).</li>
<li> Two snapshots <code>&lt;newSnapshot&gt;</code> and <code>&lt;oldSnapshot&gt;</code> have been created on the target FS, and <code>&lt;oldSnapshot&gt;</code> is older than <code>&lt;newSnapshot&gt;</code>. No change has been made on target since <code>&lt;newSnapshot&gt;</code> was created on the target. </li>
<li> The source has the same snapshot <code>&lt;oldSnapshot&gt;</code>, which has the same content as the <code>&lt;oldSnapshot&gt;</code> on the target. All the files/directories in the target&#x2019;s <code>&lt;oldSnapshot&gt;</code> are the same with source&#x2019;s <code>&lt;oldSnapshot&gt;</code>.</li> </ol> </td></tr>
<tr class="a">
<td> <code>-numListstatusThreads</code> </td>
<td> Number of threads to use for building file listing </td>
<td> At most 40 threads. </td></tr>
<tr class="b">
<td> <code>-skipcrccheck</code> </td>
<td> Whether to skip CRC checks between source and target paths. </td>
<td> </td></tr>
<tr class="a">
<td> <code>-blocksperchunk &lt;blocksperchunk&gt;</code> </td>
<td> Number of blocks per chunk. When specified, split files into chunks to copy in parallel </td>
<td> If set to a positive value, files with more blocks than this value will be split into chunks of <code>&lt;blocksperchunk&gt;</code> blocks to be transferred in parallel, and reassembled on the destination. By default, <code>&lt;blocksperchunk&gt;</code> is 0 and the files will be transmitted in their entirety without splitting. This switch is only applicable when the source file system implements getBlockLocations method and the target file system implements concat method. </td></tr>
<tr class="b">
<td> <code>-copybuffersize &lt;copybuffersize&gt;</code> </td>
<td> Size of the copy buffer to use. By default, <code>&lt;copybuffersize&gt;</code> is set to 8192B </td>
<td> </td></tr>
<tr class="a">
<td> <code>-xtrack &lt;path&gt;</code> </td>
<td> Save information about missing source files to the specified path. </td>
<td> This option is only valid with <code>-update</code> option. This is an experimental property and it cannot be used with <code>-atomic</code> option. </td></tr>
<tr class="b">
<td> <code>-direct</code> </td>
<td> Write directly to destination paths </td>
<td> Useful for avoiding potentially very expensive temporary file rename operations when the destination is an object store </td></tr>
<tr class="a">
<td> <code>-useiterator</code> </td>
<td> Uses single threaded listStatusIterator to build listing </td>
<td> Useful for saving memory at the client side. Using this option will ignore the numListstatusThreads option </td></tr>
<tr class="b">
<td> <code>-updateRoot</code> </td>
<td> Update root directory attributes (eg permissions, ownership &#x2026;) </td>
<td> Useful if you need to enforce root directory attributes update when using distcp </td></tr>
</tbody>
</table></section><section>
<h2><a name="Architecture_of_DistCp"></a>Architecture of DistCp</h2>
<p>The components of the new DistCp may be classified into the following categories:</p>
<ul>
<li>DistCp Driver</li>
<li>Copy-listing generator</li>
<li>Input-formats and Map-Reduce components</li>
</ul><section>
<h3><a name="DistCp_Driver"></a>DistCp Driver</h3>
<p>The DistCp Driver components are responsible for:</p>
<ul>
<li>
<p>Parsing the arguments passed to the DistCp command on the command-line, via:</p>
<ul>
<li>OptionsParser, and</li>
<li>DistCpOptionsSwitch</li>
</ul>
</li>
<li>
<p>Assembling the command arguments into an appropriate DistCpOptions object, and initializing DistCp. These arguments include:</p>
<ul>
<li>Source-paths</li>
<li>Target location</li>
<li>Copy options (e.g. whether to update-copy, overwrite, which file-attributes to preserve, etc.)</li>
</ul>
</li>
<li>
<p>Orchestrating the copy operation by:</p>
<ul>
<li>Invoking the copy-listing-generator to create the list of files to be copied.</li>
<li>Setting up and launching the Hadoop Map-Reduce Job to carry out the copy.</li>
<li>Based on the options, either returning a handle to the Hadoop MR Job immediately, or waiting till completion.</li>
</ul>
</li>
</ul>
<p>The parser-elements are exercised only from the command-line (or if DistCp::run() is invoked). The DistCp class may also be used programmatically, by constructing the DistCpOptions object, and initializing a DistCp object appropriately.</p></section><section>
<h3><a name="Copy-listing_Generator"></a>Copy-listing Generator</h3>
<p>The copy-listing-generator classes are responsible for creating the list of files/directories to be copied from source. They examine the contents of the source-paths (files/directories, including wild-cards), and record all paths that need copy into a SequenceFile, for consumption by the DistCp Hadoop Job. The main classes in this module include:</p>
<ol style="list-style-type: decimal">
<li><code>CopyListing</code>: The interface that should be implemented by any copy-listing-generator implementation. Also provides the factory method by which the concrete CopyListing implementation is chosen.</li>
<li><code>SimpleCopyListing</code>: An implementation of <code>CopyListing</code> that accepts multiple source paths (files/directories), and recursively lists all the individual files and directories under each, for copy.</li>
<li><code>GlobbedCopyListing</code>: Another implementation of <code>CopyListing</code> that expands wild-cards in the source paths.</li>
<li><code>FileBasedCopyListing</code>: An implementation of <code>CopyListing</code> that reads the source-path list from a specified file.</li>
</ol>
<p>Based on whether a source-file-list is specified in the DistCpOptions, the source-listing is generated in one of the following ways:</p>
<ol style="list-style-type: decimal">
<li>If there&#x2019;s no source-file-list, the <code>GlobbedCopyListing</code> is used. All wild-cards are expanded, and all the expansions are forwarded to the SimpleCopyListing, which in turn constructs the listing (via recursive descent of each path).</li>
<li>If a source-file-list is specified, the <code>FileBasedCopyListing</code> is used. Source-paths are read from the specified file, and then forwarded to the <code>GlobbedCopyListing</code>. The listing is then constructed as described above.</li>
</ol>
<p>One may customize the method by which the copy-listing is constructed by providing a custom implementation of the CopyListing interface. The behaviour of DistCp differs here from the legacy DistCp, in how paths are considered for copy.</p>
<p>One may also customize the filtering of files which shouldn&#x2019;t be copied by passing the current supported implementation of CopyFilter interface or a new implementation can be written. This can be specified by setting the <code>distcp.filters.class</code> in the DistCpOptions:</p>
<ol style="list-style-type: decimal">
<li><code>distcp.filters.class</code> to &#x201c;RegexCopyFilter&#x201d;. If you are using this implementation, you will have to pass along &#x201c;CopyFilter&#x201d; <code>distcp.filters.file</code> which contains the regex used for filtering. Support regular expressions specified by java.util.regex.Pattern.</li>
<li><code>distcp.filters.class</code> to &#x201c;RegexpInConfigurationFilter&#x201d;. If you are using this implementation, you will have to pass along the regex also using <code>distcp.exclude-file-regex</code> parameter in &#x201c;DistCpOptions&#x201d;. Support regular expressions specified by java.util.regex.Pattern. This is a more dynamic approach as compared to &#x201c;RegexCopyFilter&#x201d;.</li>
<li><code>distcp.filters.class</code> to &#x201c;TrueCopyFilter&#x201d;. This is used as a default implementation if none of the above options are specified.</li>
</ol>
<p>The legacy implementation only lists those paths that must definitely be copied on to target. E.g. if a file already exists at the target (and <code>-overwrite</code> isn&#x2019;t specified), the file isn&#x2019;t even considered in the MapReduce Copy Job. Determining this during setup (i.e. before the MapReduce Job) involves file-size and checksum-comparisons that are potentially time-consuming.</p>
<p>The new DistCp postpones such checks until the MapReduce Job, thus reducing setup time. Performance is enhanced further since these checks are parallelized across multiple maps.</p></section><section>
<h3><a name="InputFormats_and_MapReduce_Components"></a>InputFormats and MapReduce Components</h3>
<p>The InputFormats and MapReduce components are responsible for the actual copy of files and directories from the source to the destination path. The listing-file created during copy-listing generation is consumed at this point, when the copy is carried out. The classes of interest here include:</p>
<ul>
<li>
<p><b>UniformSizeInputFormat:</b> This implementation of org.apache.hadoop.mapreduce.InputFormat provides equivalence with Legacy DistCp in balancing load across maps. The aim of the UniformSizeInputFormat is to make each map copy roughly the same number of bytes. Apropos, the listing file is split into groups of paths, such that the sum of file-sizes in each InputSplit is nearly equal to every other map. The splitting isn&#x2019;t always perfect, but its trivial implementation keeps the setup-time low.</p>
</li>
<li>
<p><b>DynamicInputFormat and DynamicRecordReader:</b> The DynamicInputFormat implements <code>org.apache.hadoop.mapreduce.InputFormat</code>, and is new to DistCp. The listing-file is split into several &#x201c;chunk-files&#x201d;, the exact number of chunk-files being a multiple of the number of maps requested for in the Hadoop Job. Each map task is &#x201c;assigned&#x201d; one of the chunk-files (by renaming the chunk to the task&#x2019;s id), before the Job is launched. Paths are read from each chunk using the <code>DynamicRecordReader</code>, and processed in the CopyMapper. After all the paths in a chunk are processed, the current chunk is deleted and a new chunk is acquired. The process continues until no more chunks are available. This &#x201c;dynamic&#x201d; approach allows faster map-tasks to consume more paths than slower ones, thus speeding up the DistCp job overall.</p>
</li>
<li>
<p><b>CopyMapper:</b> This class implements the physical file-copy. The input-paths are checked against the input-options (specified in the Job&#x2019;s Configuration), to determine whether a file needs copy. A file will be copied only if at least one of the following is true:</p>
<ul>
<li>A file with the same name doesn&#x2019;t exist at target.</li>
<li>A file with the same name exists at target, but has a different file size.</li>
<li>A file with the same name exists at target, but has a different checksum, and <code>-skipcrccheck</code> isn&#x2019;t mentioned.</li>
<li>A file with the same name exists at target, but <code>-overwrite</code> is specified.</li>
<li>A file with the same name exists at target, but differs in block-size and block-size needs to be preserved.</li>
</ul>
</li>
<li>
<p><b>CopyCommitter:</b> This class is responsible for the commit-phase of the DistCp job, including:</p>
<ul>
<li>Preservation of directory-permissions (if specified in the options)</li>
<li>Clean-up of temporary-files, work-directories, etc.</li>
</ul>
</li>
</ul></section></section><section>
<h2><a name="Appendix"></a>Appendix</h2><section>
<h3><a name="Map_sizing"></a>Map sizing</h3>
<p>By default, DistCp makes an attempt to size each map comparably so that each copies roughly the same number of bytes. Note that files are the finest level of granularity, so increasing the number of simultaneous copiers (i.e. maps) may not always increase the number of simultaneous copies nor the overall throughput.</p>
<p>The new DistCp also provides a strategy to &#x201c;dynamically&#x201d; size maps, allowing faster data-nodes to copy more bytes than slower nodes. Using <code>-strategy dynamic</code> (explained in the Architecture), rather than to assign a fixed set of source-files to each map-task, files are instead split into several sets. The number of sets exceeds the number of maps, usually by a factor of 2-3. Each map picks up and copies all files listed in a chunk. When a chunk is exhausted, a new chunk is acquired and processed, until no more chunks remain.</p>
<p>By not assigning a source-path to a fixed map, faster map-tasks (i.e. data-nodes) are able to consume more chunks, and thus copy more data, than slower nodes. While this distribution isn&#x2019;t uniform, it is fair with regard to each mapper&#x2019;s capacity.</p>
<p>The dynamic-strategy is implemented by the <code>DynamicInputFormat</code>. It provides superior performance under most conditions.</p>
<p>Tuning the number of maps to the size of the source and destination clusters, the size of the copy, and the available bandwidth is recommended for long-running and regularly run jobs.</p></section><section>
<h3><a name="Copying_Between_Versions_of_HDFS"></a>Copying Between Versions of HDFS</h3>
<p>For copying between two different major versions of Hadoop (e.g. between 1.X and 2.X), one will usually use WebHdfsFileSystem. Unlike the previous HftpFileSystem, as webhdfs is available for both read and write operations, DistCp can be run on both source and destination cluster. Remote cluster is specified as <code>webhdfs://&lt;namenode_hostname&gt;:&lt;http_port&gt;</code>. When copying between same major versions of Hadoop cluster (e.g. between 2.X and 2.X), use hdfs protocol for better performance.</p></section><section>
<h3><a name="Secure_Copy_over_the_wire_with_distcp"></a>Secure Copy over the wire with distcp</h3>
<p>Use the &#x201c;<code>swebhdfs://</code>&#x201d; scheme when webhdfs is secured with SSL. For more information see <a href="../hadoop-project-dist/hadoop-hdfs/WebHDFS.html#SSL_Configurations_for_SWebHDFS">SSL Configurations for SWebHDFS</a>.</p></section><section>
<h3><a name="MapReduce_and_other_side-effects"></a>MapReduce and other side-effects</h3>
<p>As has been mentioned in the preceding, should a map fail to copy one of its inputs, there will be several side-effects.</p>
<ul>
<li>Unless <code>-overwrite</code> is specified, files successfully copied by a previous map on a re-execution will be marked as &#x201c;skipped&#x201d;.</li>
<li>If a map fails <code>mapreduce.map.maxattempts</code> times, the remaining map tasks will be killed (unless <code>-i</code> is set).</li>
<li>If <code>mapreduce.map.speculative</code> is set to be true, the result of the copy is undefined.</li>
</ul></section><section>
<h3><a name="DistCp_and_Object_Stores"></a>DistCp and Object Stores</h3>
<p>DistCp works with Object Stores such as Amazon S3, Azure ABFS and Google GCS.</p>
<p>Prequisites</p>
<ol style="list-style-type: decimal">
<li>The JAR containing the object store implementation is on the classpath, along with all of its dependencies.</li>
<li>Unless the JAR automatically registers its bundled filesystem clients, the configuration may need to be modified to state the class which implements the filesystem schema. All of the ASF&#x2019;s own object store clients are self-registering.</li>
<li>The relevant object store access credentials must be available in the cluster configuration, or be otherwise available in all cluster hosts.</li>
</ol>
<p>DistCp can be used to upload data</p>
<div class="source">
<div class="source">
<pre>hadoop distcp -direct hdfs://nn1:8020/datasets/set1 s3a://bucket/datasets/set1
</pre></div></div>
<p>To download data</p>
<div class="source">
<div class="source">
<pre>hadoop distcp s3a://bucket/generated/results hdfs://nn1:8020/results
</pre></div></div>
<p>To copy data between object stores</p>
<div class="source">
<div class="source">
<pre>hadoop distcp s3a://bucket/generated/results \
wasb://updates@example.blob.core.windows.net
</pre></div></div>
<p>And do copy data within an object store</p>
<div class="source">
<div class="source">
<pre>hadoop distcp wasb://updates@example.blob.core.windows.net/current \
wasb://updates@example.blob.core.windows.net/old
</pre></div></div>
<p>And to use <code>-update</code> to only copy changed files.</p>
<div class="source">
<div class="source">
<pre>hadoop distcp -update -numListstatusThreads 20 \
s3a://history/2016 \
hdfs://nn1:8020/history/2016
</pre></div></div>
<p>Because object stores are slow to list files, consider setting the <code>-numListstatusThreads</code> option when performing a <code>-update</code> operation on a large directory tree (the limit is 40 threads).</p>
<p>When <code>DistCp -update</code> is used with object stores, generally only the modification time and length of the individual files are compared, not any checksums if the checksum algorithm between the two stores is different.</p>
<ul>
<li>
<p>The <code>distcp -update</code> between two object stores with different checksum algorithm compares the modification times of source and target files along with the file size to determine whether to skip the file copy. The behavior is controlled by the property <code>distcp.update.modification.time</code>, which is set to true by default. If the source file is more recently modified than the target file, it is assumed that the content has changed, and the file should be updated. We need to ensure that there is no clock skew between the machines. The fact that most object stores do have valid timestamps for directories is irrelevant; only the file timestamps are compared. However, it is important to have the clock of the client computers close to that of the infrastructure, so that timestamps are consistent between the client/HDFS cluster and that of the object store. Otherwise, changed files may be missed/copied too often.</p>
</li>
<li>
<p><code>distcp.update.modification.time</code> would only be used if either of the two stores don&#x2019;t have checksum validation resulting in incompatible checksum comparison between the two. Even if the property is set to true, it won&#x2019;t be used if there is valid checksum comparison between the two stores.</p>
</li>
</ul>
<p>To turn off the modification time check, set this in your core-site.xml</p>
<div class="source">
<div class="source">
<pre>&lt;property&gt;
&lt;name&gt;distcp.update.modification.time&lt;/name&gt;
&lt;value&gt;false&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<p><b>Notes</b></p>
<ul>
<li>
<p>The <code>-atomic</code> option causes a rename of the temporary data, so significantly increases the time to commit work at the end of the operation. Furthermore, as Object Stores other than (optionally) <code>wasb://</code> do not offer atomic renames of directories the <code>-atomic</code> operation doesn&#x2019;t actually deliver what is promised. <i>Avoid</i>.</p>
</li>
<li>
<p>The <code>-append</code> option is not supported.</p>
</li>
<li>
<p>The <code>-diff</code> and <code>rdiff</code> options are not supported</p>
</li>
<li>
<p>CRC checking will not be performed, irrespective of the value of the <code>-skipCrc</code> flag.</p>
</li>
<li>
<p>All <code>-p</code> options, including those to preserve permissions, user and group information, attributes checksums and replication are generally ignored. The <code>wasb://</code> connector will preserve the information, but not enforce the permissions.</p>
</li>
<li>
<p>Some object store connectors offer an option for in-memory buffering of output &#x2014;for example the S3A connector. Using such option while copying large files may trigger some form of out of memory event, be it a heap overflow or a YARN container termination. This is particularly common if the network bandwidth between the cluster and the object store is limited (such as when working with remote object stores). It is best to disable/avoid such options and rely on disk buffering.</p>
</li>
<li>
<p>Copy operations within a single object store still take place in the Hadoop cluster &#x2014;even when the object store implements a more efficient COPY operation internally</p>
<p>That is, an operation such as</p>
<p>hadoop distcp <a class="externalLink" href="s3a://bucket/datasets/set1">s3a://bucket/datasets/set1</a> <a class="externalLink" href="s3a://bucket/datasets/set2">s3a://bucket/datasets/set2</a></p>
<p>Copies each byte down to the Hadoop worker nodes and back to the bucket. As well as being slow, it means that charges may be incurred.</p>
</li>
<li>
<p>The <code>-direct</code> option can be used to write to object store target paths directly, avoiding the potentially very expensive temporary file rename operations that would otherwise occur.</p>
</li>
</ul></section></section><section>
<h2><a name="Frequently_Asked_Questions"></a>Frequently Asked Questions</h2>
<ol style="list-style-type: decimal">
<li>
<p><b>Why does -update not create the parent source-directory under a pre-existing target directory?</b> The behaviour of <code>-update</code> and <code>-overwrite</code> is described in detail in the Usage section of this document. In short, if either option is used with a pre-existing destination directory, the <b>contents</b> of each source directory is copied over, rather than the source-directory itself. This behaviour is consistent with the legacy DistCp implementation as well.</p>
</li>
<li>
<p><b>How does the new DistCp differs in semantics from the Legacy DistCp?</b></p>
<ul>
<li>Files that are skipped during copy used to also have their file-attributes (permissions, owner/group info, etc.) unchanged, when copied with Legacy DistCp. These are now updated, even if the file-copy is skipped.</li>
<li>Empty root directories among the source-path inputs were not created at the target, in Legacy DistCp. These are now created.</li>
</ul>
</li>
<li>
<p><b>Why does the new DistCp use more maps than legacy DistCp?</b> Legacy DistCp works by figuring out what files need to be actually copied to target before the copy-job is launched, and then launching as many maps as required for copy. So if a majority of the files need to be skipped (because they already exist, for example), fewer maps will be needed. As a consequence, the time spent in setup (i.e. before the M/R job) is higher. The new DistCp calculates only the contents of the source-paths. It doesn&#x2019;t try to filter out what files can be skipped. That decision is put off till the M/R job runs. This is much faster (vis-a-vis execution-time), but the number of maps launched will be as specified in the <code>-m</code> option, or 20 (default) if unspecified.</p>
</li>
<li>
<p><b>Why does DistCp not run faster when more maps are specified?</b> At present, the smallest unit of work for DistCp is a file. i.e., a file is processed by only one map. Increasing the number of maps to a value exceeding the number of files would yield no performance benefit. The number of maps launched would equal the number of files.</p>
</li>
<li>
<p><b>Why does DistCp run out of memory?</b> If the number of individual files/directories being copied from the source path(s) is extremely large (e.g. 1,000,000 paths), DistCp might run out of memory while determining the list of paths for copy. This is not unique to the new DistCp implementation. To get around this, consider changing the <code>-Xmx</code> JVM heap-size parameters, as follows:</p>
<div class="source">
<div class="source">
<pre> bash$ export HADOOP_CLIENT_OPTS=&quot;-Xms64m -Xmx1024m&quot;
bash$ hadoop distcp /source /target
</pre></div></div>
</li>
</ol></section>
</div>
</div>
<div class="clear">
<hr/>
</div>
<div id="footer">
<div class="xright">
&#169; 2008-2023
Apache Software Foundation
- <a href="http://maven.apache.org/privacy-policy.html">Privacy Policy</a>.
Apache Maven, Maven, Apache, the Apache feather logo, and the Apache Maven project logos are trademarks of The Apache Software Foundation.
</div>
<div class="clear">
<hr/>
</div>
</div>
</body>
</html>