+ The most common invocation of DistCp is an inter-cluster copy:
+ bash$ hadoop jar hadoop-distcp.jar hdfs://nn1:8020/foo/bar \
+ hdfs://nn2:8020/bar/foo
+
+ This will expand the namespace under /foo/bar
on nn1
+ into a temporary file, partition its contents among a set of map
+ tasks, and start a copy on each TaskTracker from nn1 to nn2.
+
+ One can also specify multiple source directories on the command
+ line:
+ bash$ hadoop jar hadoop-distcp.jar hdfs://nn1:8020/foo/a \
+ hdfs://nn1:8020/foo/b \
+ hdfs://nn2:8020/bar/foo
+
+ Or, equivalently, from a file using the -f
option:
+ bash$ hadoop jar hadoop-distcp.jar -f hdfs://nn1:8020/srclist \
+ hdfs://nn2:8020/bar/foo
+
+ Where srclist
contains
+ hdfs://nn1:8020/foo/a
+ hdfs://nn1:8020/foo/b
+
+ When copying from multiple sources, DistCp will abort the copy with
+ an error message if two sources collide, but collisions at the
+ destination are resolved per the options
+ specified. By default, files already existing at the destination are
+ skipped (i.e. not replaced by the source file). A count of skipped
+ files is reported at the end of each job, but it may be inaccurate if a
+ copier failed for some subset of its files, but succeeded on a later
+ attempt.
+
+ It is important that each TaskTracker can reach and communicate with
+ both the source and destination file systems. For HDFS, both the source
+ and destination must be running the same version of the protocol or use
+ a backwards-compatible protocol (see Copying Between
+ Versions).
+
+ After a copy, it is recommended that one generates and cross-checks
+ a listing of the source and destination to verify that the copy was
+ truly successful. Since DistCp employs both Map/Reduce and the
+ FileSystem API, issues in or between any of the three could adversely
+ and silently affect the copy. Some have had success running with
+ -update
enabled to perform a second pass, but users should
+ be acquainted with its semantics before attempting this.
+
+ It's also worth noting that if another client is still writing to a
+ source file, the copy will likely fail. Attempting to overwrite a file
+ being written at the destination should also fail on HDFS. If a source
+ file is (re)moved before it is copied, the copy will fail with a
+ FileNotFoundException.
+
+ Please refer to the detailed Command Line Reference for information
+ on all the options available in DistCp.
+
+
+
+
+ -update
is used to copy files from source that don't
+ exist at the target, or have different contents. -overwrite
+ overwrites target-files even if they exist at the source, or have the
+ same contents.
+
+
Update and Overwrite options warrant special attention, since their
+ handling of source-paths varies from the defaults in a very subtle manner.
+ Consider a copy from /source/first/
and
+ /source/second/
to /target/
, where the source
+ paths have the following contents:
+
+ hdfs://nn1:8020/source/first/1
+ hdfs://nn1:8020/source/first/2
+ hdfs://nn1:8020/source/second/10
+ hdfs://nn1:8020/source/second/20
+
+
When DistCp is invoked without -update
or
+ -overwrite
, the DistCp defaults would create directories
+ first/
and second/
, under /target
.
+ Thus:
+
+ distcp hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second hdfs://nn2:8020/target
+
would yield the following contents in /target
:
+
+ hdfs://nn2:8020/target/first/1
+ hdfs://nn2:8020/target/first/2
+ hdfs://nn2:8020/target/second/10
+ hdfs://nn2:8020/target/second/20
+
+
When either -update
or -overwrite
is
+ specified, the contents of the source-directories
+ are copied to target, and not the source directories themselves. Thus:
+
+ distcp -update hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second hdfs://nn2:8020/target
+
+
would yield the following contents in /target
:
+
+ hdfs://nn2:8020/target/1
+ hdfs://nn2:8020/target/2
+ hdfs://nn2:8020/target/10
+ hdfs://nn2:8020/target/20
+
+
By extension, if both source folders contained a file with the same
+ name (say, 0
), then both sources would map an entry to
+ /target/0
at the destination. Rather than to permit this
+ conflict, DistCp will abort.
+
+
Now, consider the following copy operation:
+
+ distcp hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second hdfs://nn2:8020/target
+
+
With sources/sizes:
+
+ hdfs://nn1:8020/source/first/1 32
+ hdfs://nn1:8020/source/first/2 32
+ hdfs://nn1:8020/source/second/10 64
+ hdfs://nn1:8020/source/second/20 32
+
+
And destination/sizes:
+
+ hdfs://nn2:8020/target/1 32
+ hdfs://nn2:8020/target/10 32
+ hdfs://nn2:8020/target/20 64
+
+
Will effect:
+
+ hdfs://nn2:8020/target/1 32
+ hdfs://nn2:8020/target/2 32
+ hdfs://nn2:8020/target/10 64
+ hdfs://nn2:8020/target/20 32
+
+
1
is skipped because the file-length and contents match.
+ 2
is copied because it doesn't exist at the target.
+ 10
and 20
are overwritten since the contents
+ don't match the source.
+
+ If -update
is used, 1
is overwritten as well.
+
+
+
+
+