HADOOP-16037. DistCp: Document usage of Sync (-diff option) in detail.

Contributed by Siyao Meng

(cherry picked from commit ce4bafdf44)
This commit is contained in:
Siyao Meng 2019-03-26 18:43:43 +00:00 committed by Steve Loughran
parent 162e9999c7
commit 52cfbc39cc
No known key found for this signature in database
GPG Key ID: D22CF846DBB162A0
1 changed files with 120 additions and 0 deletions

View File

@ -13,6 +13,7 @@
-->
#set ( $H3 = '###' )
#set ( $H4 = '####' )
DistCp Guide
=====================
@ -23,6 +24,7 @@ DistCp Guide
- [Usage](#Usage)
- [Basic Usage](#Basic_Usage)
- [Update and Overwrite](#Update_and_Overwrite)
- [Sync](#Sync)
- [Command Line Options](#Command_Line_Options)
- [Architecture of DistCp](#Architecture_of_DistCp)
- [DistCp Driver](#DistCp_Driver)
@ -192,6 +194,124 @@ $H3 Update and Overwrite
If `-overwrite` is used, `1` is overwritten as well.
$H3 Sync
`-diff` option syncs files from a source cluster to a target cluster with a
snapshot diff. It copies, renames and removes files in the snapshot diff list.
`-update` option must be included when `-diff` option is in use.
Most cloud providers don't work well with sync at the moment.
Usage:
hadoop distcp -update -diff <from_snapshot> <to_snapshot> <source> <destination>
Example:
hadoop distcp -update -diff snap1 snap2 /src/ /dst/
The command above applies changes from snapshot `snap1` to `snap2`
(i.e. snapshot diff from `snap1` to `snap2`) in `/src/` to `/dst/`.
Obviously, it requires `/src/` to have both snapshots `snap1` and `snap2`.
But the destination `/dst/` must also have a snapshot with the same
name as `<from_snapshot>`, in this case `snap1`. The destination `/dst/`
should not have new file operations (create, rename, delete) since `snap1`.
Note that when this command finishes, a new snapshot `snap2` will NOT be
created at `/dst/`.
`-update` is required to use `-diff` option.
For instance, in `/src/`, if `1.txt` is added and `2.txt` is deleted after
the creation of `snap1` and before creation of `snap2`, the command above
will copy `1.txt` from `/src/` to `/dst/` and delete `2.txt` from `/dst/`.
Sync behavior will be elaborated using experiments below.
$H4 Experiment 1: Syncing diff of two adjacent snapshots
Some preparations before we start.
# Create source and destination directories
hdfs dfs -mkdir /src/ /dst/
# Allow snapshot on source
hdfs dfsadmin -allowSnapshot /src/
# Create a snapshot (empty one)
hdfs dfs -createSnapshot /src/ snap1
# Allow snapshot on destination
hdfs dfsadmin -allowSnapshot /dst/
# Create a from_snapshot with the same name
hdfs dfs -createSnapshot /dst/ snap1
# Put one text file under /src/
echo "This is the 1st text file." > 1.txt
hdfs dfs -put 1.txt /src/
# Create the second snapshot
hdfs dfs -createSnapshot /src/ snap2
# Put another text file under /src/
echo "This is the 2nd text file." > 2.txt
hdfs dfs -put 2.txt /src/
# Create the third snapshot
hdfs dfs -createSnapshot /src/ snap3
Then we run distcp sync:
hadoop distcp -update -diff snap1 snap2 /src/ /dst/
The command above should succeed. `1.txt` will be copied from `/src/` to
`/dst/`. Again, `-update` option is required.
If we run the same command again, we will get `DistCp sync failed` exception
because the destination has added a new file `1.txt` since `snap1`. That
being said, if we remove `1.txt` manually from `/dst/` and run the sync, the
command will succeed.
$H4 Experiment 2: syncing diff of two non-adjacent snapshots
First do a clean up from Experiment 1.
hdfs dfs -rm -skipTrash /dst/1.txt
Run sync command, note the `<to_snapshot>` has been changed from `snap2` in
Experiment 1 to `snap3`.
hadoop distcp -update -diff snap1 snap3 /src/ /dst/
Both `1.txt` and `2.txt` will be copied to `/dst/`.
$H4 Experiment 3: syncing file delete operation
Continuing from the end of Experiment 2:
hdfs dfs -rm -skipTrash /dst/2.txt
# Create snap2 at destination, it contains 1.txt
hdfs dfs -createSnapshot /dst/ snap2
# Delete 1.txt from source
hdfs dfs -rm -skipTrash /src/1.txt
# Create snap4 at source, it only contains 2.txt
hdfs dfs -createSnapshot /src/ snap4
Run sync command now:
hadoop distcp -update -diff snap2 snap4 /src/ /dst/
`2.txt` is copied and `1.txt` is deleted under `/dst/`.
Note that, though both `/src/` and `/dst/` have snapshot with the same name
`snap2`, the snapshots don't need to have the same content.
That means, if you have a `1.txt` in `/dst/`'s `snap2` but they have different
contents, `1.txt` will still be removed from `/dst/`.
The sync command doesn't check the contents of the files that is going to
be deleted. It simply follows the snapshot diff list between `<from_snapshot>`
and <to_snapshot>.
Also, if we delete `1.txt` from `/dst/` before creating `snap2` on `/dst/`
in the steps above, so that `/dst/`'s `snap2` doesn't have `1.txt` before
running sync command, the command will still succeed. It won't throw exception
while trying to delete `1.txt` from `/dst/` which doesn't exist.
$H3 raw Namespace Extended Attribute Preservation
This section only applies to HDFS.