HADOOP-13655. document object store use with fs shell and distcp. Contributed by Steve Loughran
This closes #131
This commit is contained in:
parent
83cc7263af
commit
beb70fed4f
|
@ -53,10 +53,14 @@ Returns 0 on success and 1 on error.
|
|||
cat
|
||||
---
|
||||
|
||||
Usage: `hadoop fs -cat URI [URI ...]`
|
||||
Usage: `hadoop fs -cat [-ignoreCrc] URI [URI ...]`
|
||||
|
||||
Copies source paths to stdout.
|
||||
|
||||
Options
|
||||
|
||||
* The `-ignoreCrc` option disables checkshum verification.
|
||||
|
||||
Example:
|
||||
|
||||
* `hadoop fs -cat hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2`
|
||||
|
@ -116,11 +120,16 @@ copyFromLocal
|
|||
|
||||
Usage: `hadoop fs -copyFromLocal <localsrc> URI`
|
||||
|
||||
Similar to put command, except that the source is restricted to a local file reference.
|
||||
Similar to the `fs -put` command, except that the source is restricted to a local file reference.
|
||||
|
||||
Options:
|
||||
|
||||
* The -f option will overwrite the destination if it already exists.
|
||||
* `-p` : Preserves access and modification times, ownership and the permissions.
|
||||
(assuming the permissions can be propagated across filesystems)
|
||||
* `-f` : Overwrites the destination if it already exists.
|
||||
* `-l` : Allow DataNode to lazily persist the file to disk, Forces a replication
|
||||
factor of 1. This flag will result in reduced durability. Use with care.
|
||||
* `-d` : Skip creation of temporary file with the suffix `._COPYING_`.
|
||||
|
||||
copyToLocal
|
||||
-----------
|
||||
|
@ -300,7 +309,7 @@ Returns 0 on success and -1 on error.
|
|||
get
|
||||
---
|
||||
|
||||
Usage: `hadoop fs -get [-ignorecrc] [-crc] <src> <localdst> `
|
||||
Usage: `hadoop fs -get [-ignorecrc] [-crc] [-p] [-f] <src> <localdst> `
|
||||
|
||||
Copy files to the local file system. Files that fail the CRC check may be copied with the -ignorecrc option. Files and CRCs may be copied using the -crc option.
|
||||
|
||||
|
@ -315,7 +324,11 @@ Returns 0 on success and -1 on error.
|
|||
|
||||
Options:
|
||||
|
||||
The -f option will overwrite the destination if it already exists.
|
||||
* `-p` : Preserves access and modification times, ownership and the permissions.
|
||||
(assuming the permissions can be propagated across filesystems)
|
||||
* `-f` : Overwrites the destination if it already exists.
|
||||
* `-ignorecrc` : Skip CRC checks on the file(s) downloaded.
|
||||
* `-crc`: write CRC checksums for the files downloaded.
|
||||
|
||||
getfacl
|
||||
-------
|
||||
|
@ -483,13 +496,28 @@ Returns 0 on success and -1 on error.
|
|||
put
|
||||
---
|
||||
|
||||
Usage: `hadoop fs -put <localsrc> ... <dst> `
|
||||
Usage: `hadoop fs -put [-f] [-p] [-l] [-d] [ - | <localsrc1> .. ]. <dst>`
|
||||
|
||||
Copy single src, or multiple srcs from local file system to the destination file system. Also reads input from stdin and writes to destination file system.
|
||||
Copy single src, or multiple srcs from local file system to the destination file system.
|
||||
Also reads input from stdin and writes to destination file system if the source is set to "-"
|
||||
|
||||
Copying fails if the file already exists, unless the -f flag is given.
|
||||
|
||||
Options:
|
||||
|
||||
* `-p` : Preserves access and modification times, ownership and the permissions.
|
||||
(assuming the permissions can be propagated across filesystems)
|
||||
* `-f` : Overwrites the destination if it already exists.
|
||||
* `-l` : Allow DataNode to lazily persist the file to disk, Forces a replication
|
||||
factor of 1. This flag will result in reduced durability. Use with care.
|
||||
* `-d` : Skip creation of temporary file with the suffix `._COPYING_`.
|
||||
|
||||
|
||||
Examples:
|
||||
|
||||
* `hadoop fs -put localfile /user/hadoop/hadoopfile`
|
||||
* `hadoop fs -put localfile1 localfile2 /user/hadoop/hadoopdir`
|
||||
* `hadoop fs -put localfile hdfs://nn.example.com/hadoop/hadoopfile`
|
||||
* `hadoop fs -put -f localfile1 localfile2 /user/hadoop/hadoopdir`
|
||||
* `hadoop fs -put -d localfile hdfs://nn.example.com/hadoop/hadoopfile`
|
||||
* `hadoop fs -put - hdfs://nn.example.com/hadoop/hadoopfile` Reads the input from stdin.
|
||||
|
||||
Exit Code:
|
||||
|
@ -696,7 +724,7 @@ touchz
|
|||
|
||||
Usage: `hadoop fs -touchz URI [URI ...]`
|
||||
|
||||
Create a file of zero length.
|
||||
Create a file of zero length. An error is returned if the file exists with non-zero length.
|
||||
|
||||
Example:
|
||||
|
||||
|
@ -729,3 +757,279 @@ usage
|
|||
Usage: `hadoop fs -usage command`
|
||||
|
||||
Return the help for an individual command.
|
||||
|
||||
|
||||
<a name="ObjectStores" />Working with Object Storage
|
||||
====================================================
|
||||
|
||||
The Hadoop FileSystem shell works with Object Stores such as Amazon S3,
|
||||
Azure WASB and OpenStack Swift.
|
||||
|
||||
|
||||
|
||||
```bash
|
||||
# Create a directory
|
||||
hadoop fs -mkdir s3a://bucket/datasets/
|
||||
|
||||
# Upload a file from the cluster filesystem
|
||||
hadoop fs -put /datasets/example.orc s3a://bucket/datasets/
|
||||
|
||||
# touch a file
|
||||
hadoop fs -touchz wasb://yourcontainer@youraccount.blob.core.windows.net/touched
|
||||
```
|
||||
|
||||
Unlike a normal filesystem, renaming files and directories in an object store
|
||||
usually takes time proportional to the size of the objects being manipulated.
|
||||
As many of the filesystem shell operations
|
||||
use renaming as the final stage in operations, skipping that stage
|
||||
can avoid long delays.
|
||||
|
||||
In particular, the `put` and `copyFromLocal` commands should
|
||||
both have the `-d` options set for a direct upload.
|
||||
|
||||
|
||||
```bash
|
||||
# Upload a file from the cluster filesystem
|
||||
hadoop fs -put -d /datasets/example.orc s3a://bucket/datasets/
|
||||
|
||||
# Upload a file from under the user's home directory in the local filesystem.
|
||||
# Note it is the shell expanding the "~", not the hadoop fs command
|
||||
hadoop fs -copyFromLocal -d -f ~/datasets/devices.orc s3a://bucket/datasets/
|
||||
|
||||
# create a file from stdin
|
||||
# the special "-" source means "use stdin"
|
||||
echo "hello" | hadoop fs -put -d -f - wasb://yourcontainer@youraccount.blob.core.windows.net/hello.txt
|
||||
|
||||
```
|
||||
|
||||
Objects can be downloaded and viewed:
|
||||
|
||||
```bash
|
||||
# copy a directory to the local filesystem
|
||||
hadoop fs -copyToLocal s3a://bucket/datasets/
|
||||
|
||||
# copy a file from the object store to the cluster filesystem.
|
||||
hadoop fs -get wasb://yourcontainer@youraccount.blob.core.windows.net/hello.txt /examples
|
||||
|
||||
# print the object
|
||||
hadoop fs -cat wasb://yourcontainer@youraccount.blob.core.windows.net/hello.txt
|
||||
|
||||
# print the object, unzipping it if necessary
|
||||
hadoop fs -text wasb://yourcontainer@youraccount.blob.core.windows.net/hello.txt
|
||||
|
||||
## download log files into a local file
|
||||
hadoop fs -getmerge wasb://yourcontainer@youraccount.blob.core.windows.net/logs\* log.txt
|
||||
```
|
||||
|
||||
Commands which list many files tend to be significantly slower than when
|
||||
working with HDFS or other filesystems
|
||||
|
||||
```bash
|
||||
hadoop fs -count s3a://bucket/
|
||||
hadoop fs -du s3a://bucket/
|
||||
```
|
||||
|
||||
Other slow commands include `find`, `mv`, `cp` and `rm`.
|
||||
|
||||
**Find**
|
||||
|
||||
This can be very slow on a large store with many directories under the path
|
||||
supplied.
|
||||
|
||||
```bash
|
||||
# enumerate all files in the object store's container.
|
||||
hadoop fs -find s3a://bucket/ -print
|
||||
|
||||
# remember to escape the wildcards to stop the shell trying to expand them first
|
||||
hadoop fs -find s3a://bucket/datasets/ -name \*.txt -print
|
||||
```
|
||||
|
||||
**Rename**
|
||||
|
||||
The time to rename a file depends on its size.
|
||||
|
||||
The time to rename a directory depends on the number and size of all files
|
||||
beneath that directory.
|
||||
|
||||
```bash
|
||||
hadoop fs -mv s3a://bucket/datasets s3a://bucket/historical
|
||||
```
|
||||
|
||||
If the operation is interrupted, the object store will be in an undefined
|
||||
state.
|
||||
|
||||
**Copy**
|
||||
|
||||
```bash
|
||||
hadoop fs -cp s3a://bucket/datasets s3a://bucket/historical
|
||||
```
|
||||
|
||||
The copy operation reads each file and then writes it back to the object store;
|
||||
the time to complete depends on the amount of data to copy, and the bandwidth
|
||||
in both directions between the local computer and the object store.
|
||||
|
||||
**The further the computer is from the object store, the longer the copy takes**
|
||||
|
||||
Deleting objects
|
||||
----------------
|
||||
|
||||
The `rm` command will delete objects and directories full of objects.
|
||||
If the object store is *eventually consistent*, `fs ls` commands
|
||||
and other accessors may briefly return the details of the now-deleted objects; this
|
||||
is an artifact of object stores which cannot be avoided.
|
||||
|
||||
If the filesystem client is configured to copy files to a trash directory,
|
||||
this will be in the bucket; the `rm` operation will then take time proportional
|
||||
to the size of the data. Furthermore, the deleted files will continue to incur
|
||||
storage costs.
|
||||
|
||||
To avoid this, use the the `-skipTrash` option.
|
||||
|
||||
```bash
|
||||
hadoop fs -rm -skipTrash s3a://bucket/dataset
|
||||
```
|
||||
|
||||
Data moved to the `.Trash` directory can be purged using the `expunge` command.
|
||||
As this command only works with the default filesystem, it must be configured to
|
||||
make the default filesystem the target object store.
|
||||
|
||||
```bash
|
||||
hadoop fs -expunge -D fs.defaultFS=s3a://bucket/
|
||||
```
|
||||
|
||||
Overwriting Objects
|
||||
----------------
|
||||
|
||||
If an object store is *eventually consistent*, then any operation which
|
||||
overwrites existing objects may not be immediately visible to all clients/queries.
|
||||
That is: later operations which query the same object's status or contents
|
||||
may get the previous object. This can sometimes surface within the same client,
|
||||
while reading a single object.
|
||||
|
||||
Avoid having a sequence of commands which overwrite objects and then immediately
|
||||
work on the updated data; there is a risk that the previous data will be used
|
||||
instead.
|
||||
|
||||
Timestamps
|
||||
----------
|
||||
|
||||
Timestamps of objects and directories in Object Stores
|
||||
may not follow the behavior of files and directories in HDFS.
|
||||
|
||||
1. The creation and initial modification times of an object will be the
|
||||
time it was created on the object store; this will be at the end of the write process,
|
||||
not the beginning.
|
||||
1. The timestamp will be taken from the object store infrastructure's clock, not that of
|
||||
the client.
|
||||
1. If an object is overwritten, the modification time will be updated.
|
||||
1. Directories may or may not have valid timestamps. They are unlikely
|
||||
to have their modification times updated when an object underneath is updated.
|
||||
1. The `atime` access time feature is not supported by any of the object stores
|
||||
found in the Apache Hadoop codebase.
|
||||
|
||||
Consult the `DistCp` documentation for details on how this may affect the
|
||||
`distcp -update` operation.
|
||||
|
||||
Security model and operations
|
||||
-----------------------------
|
||||
|
||||
The security and permissions models of object stores are usually very different
|
||||
from those of a Unix-style filesystem; operations which query or manipulate
|
||||
permissions are generally unsupported.
|
||||
|
||||
Operations to which this applies include: `chgrp`, `chmod`, `chown`,
|
||||
`getfacl`, and `setfacl`. The related attribute commands `getfattr` and`setfattr`
|
||||
are also usually unavailable.
|
||||
|
||||
* Filesystem commands which list permission and user/group details, usually
|
||||
simulate these details.
|
||||
|
||||
* Operations which try to preserve permissions (example `fs -put -p`)
|
||||
do not preserve permissions for this reason. (Special case: `wasb://`, which preserves
|
||||
permissions but does not enforce them).
|
||||
|
||||
When interacting with read-only object stores, the permissions found in "list"
|
||||
and "stat" commands may indicate that the user has write access, when in fact they do not.
|
||||
|
||||
Object stores usually have permissions models of their own,
|
||||
models can be manipulated through store-specific tooling.
|
||||
Be aware that some of the permissions which an object store may provide
|
||||
(such as write-only paths, or different permissions on the root path) may
|
||||
be incompatible with the Hadoop filesystem clients. These tend to require full
|
||||
read and write access to the entire object store bucket/container into which they write data.
|
||||
|
||||
As an example of how permissions are mocked, here is a listing of Amazon's public,
|
||||
read-only bucket of Landsat images:
|
||||
|
||||
```bash
|
||||
$ hadoop fs -ls s3a://landsat-pds/
|
||||
Found 10 items
|
||||
drwxrwxrwx - mapred 0 2016-09-26 12:16 s3a://landsat-pds/L8
|
||||
-rw-rw-rw- 1 mapred 23764 2015-01-28 18:13 s3a://landsat-pds/index.html
|
||||
drwxrwxrwx - mapred 0 2016-09-26 12:16 s3a://landsat-pds/landsat-pds_stats
|
||||
-rw-rw-rw- 1 mapred 105 2016-08-19 18:12 s3a://landsat-pds/robots.txt
|
||||
-rw-rw-rw- 1 mapred 38 2016-09-26 12:16 s3a://landsat-pds/run_info.json
|
||||
drwxrwxrwx - mapred 0 2016-09-26 12:16 s3a://landsat-pds/runs
|
||||
-rw-rw-rw- 1 mapred 27458808 2016-09-26 12:16 s3a://landsat-pds/scene_list.gz
|
||||
drwxrwxrwx - mapred 0 2016-09-26 12:16 s3a://landsat-pds/tarq
|
||||
drwxrwxrwx - mapred 0 2016-09-26 12:16 s3a://landsat-pds/tarq_corrupt
|
||||
drwxrwxrwx - mapred 0 2016-09-26 12:16 s3a://landsat-pds/test
|
||||
```
|
||||
|
||||
1. All files are listed as having full read/write permissions.
|
||||
1. All directories appear to have full `rwx` permissions.
|
||||
1. The replication count of all files is "1".
|
||||
1. The owner of all files and directories is declared to be the current user (`mapred`).
|
||||
1. The timestamp of all directories is actually that of the time the `-ls` operation
|
||||
was executed. This is because these directories are not actual objects in the store;
|
||||
they are simulated directories based on the existence of objects under their paths.
|
||||
|
||||
When an attempt is made to delete one of the files, the operation fails —despite
|
||||
the permissions shown by the `ls` command:
|
||||
|
||||
```bash
|
||||
$ hadoop fs -rm s3a://landsat-pds/scene_list.gz
|
||||
rm: s3a://landsat-pds/scene_list.gz: delete on s3a://landsat-pds/scene_list.gz:
|
||||
com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3;
|
||||
Status Code: 403; Error Code: AccessDenied; Request ID: 1EF98D5957BCAB3D),
|
||||
S3 Extended Request ID: wi3veOXFuFqWBUCJgV3Z+NQVj9gWgZVdXlPU4KBbYMsw/gA+hyhRXcaQ+PogOsDgHh31HlTCebQ=
|
||||
```
|
||||
|
||||
This demonstrates that the listed permissions cannot
|
||||
be taken as evidence of write access; only object manipulation can determine
|
||||
this.
|
||||
|
||||
Note that the Microsoft Azure WASB filesystem does allow permissions to be set and checked,
|
||||
however the permissions are not actually enforced. This feature offers
|
||||
the ability for a HDFS directory tree to be backed up with DistCp, with
|
||||
its permissions preserved, permissions which may be restored when copying
|
||||
the directory back into HDFS. For securing access to the data in the object
|
||||
store, however, Azure's [own model and tools must be used](https://azure.microsoft.com/en-us/documentation/articles/storage-security-guide/).
|
||||
|
||||
|
||||
Commands of limited value
|
||||
---------------------------
|
||||
|
||||
Here is the list of shell commands which generally have no effect —and may
|
||||
actually fail.
|
||||
|
||||
| command | limitations |
|
||||
|-----------|-------------|
|
||||
| `appendToFile` | generally unsupported |
|
||||
| `checksum` | the usual checksum is "NONE" |
|
||||
| `chgrp` | generally unsupported permissions model; no-op |
|
||||
| `chmod` | generally unsupported permissions model; no-op|
|
||||
| `chown` | generally unsupported permissions model; no-op |
|
||||
| `createSnapshot` | generally unsupported |
|
||||
| `deleteSnapshot` | generally unsupported |
|
||||
| `df` | default values are normally displayed |
|
||||
| `getfacl` | may or may not be supported |
|
||||
| `getfattr` | generally supported |
|
||||
| `renameSnapshot` | generally unsupported |
|
||||
| `setfacl` | generally unsupported permissions model |
|
||||
| `setfattr` | generally unsupported permissions model |
|
||||
| `setrep`| has no effect |
|
||||
| `truncate` | generally unsupported |
|
||||
|
||||
Different object store clients *may* support these commands: do consult the
|
||||
documentation and test against the target store.
|
||||
|
|
|
@ -30,10 +30,10 @@ are places where HDFS diverges from the expected behaviour of a POSIX
|
|||
filesystem.
|
||||
|
||||
The behaviour of other Hadoop filesystems are not as rigorously tested.
|
||||
The bundled S3 FileSystem makes Amazon's S3 Object Store ("blobstore")
|
||||
The bundled S3N and S3A FileSystem clients make Amazon's S3 Object Store ("blobstore")
|
||||
accessible through the FileSystem API. The Swift FileSystem driver provides similar
|
||||
functionality for the OpenStack Swift blobstore. The Azure object storage
|
||||
FileSystem in branch-1-win talks to Microsoft's Azure equivalent. All of these
|
||||
FileSystem talks to Microsoft's Azure equivalent. All of these
|
||||
bind to object stores, which do have different behaviors, especially regarding
|
||||
consistency guarantees, and atomicity of operations.
|
||||
|
||||
|
|
|
@ -68,7 +68,7 @@ $H3 Basic Usage
|
|||
|
||||
This will expand the namespace under `/foo/bar` on nn1 into a temporary file,
|
||||
partition its contents among a set of map tasks, and start a copy on each
|
||||
NodeManager from nn1 to nn2.
|
||||
NodeManager from `nn1` to `nn2`.
|
||||
|
||||
One can also specify multiple source directories on the command line:
|
||||
|
||||
|
@ -110,7 +110,7 @@ $H3 Basic Usage
|
|||
It's also worth noting that if another client is still writing to a source
|
||||
file, the copy will likely fail. Attempting to overwrite a file being written
|
||||
at the destination should also fail on HDFS. If a source file is (re)moved
|
||||
before it is copied, the copy will fail with a FileNotFoundException.
|
||||
before it is copied, the copy will fail with a `FileNotFoundException`.
|
||||
|
||||
Please refer to the detailed Command Line Reference for information on all
|
||||
the options available in DistCp.
|
||||
|
@ -196,20 +196,20 @@ $H3 raw Namespace Extended Attribute Preservation
|
|||
|
||||
This section only applies to HDFS.
|
||||
|
||||
If the target and all of the source pathnames are in the /.reserved/raw
|
||||
If the target and all of the source pathnames are in the `/.reserved/raw`
|
||||
hierarchy, then 'raw' namespace extended attributes will be preserved.
|
||||
'raw' xattrs are used by the system for internal functions such as encryption
|
||||
meta data. They are only visible to users when accessed through the
|
||||
/.reserved/raw hierarchy.
|
||||
`/.reserved/raw` hierarchy.
|
||||
|
||||
raw xattrs are preserved based solely on whether /.reserved/raw prefixes are
|
||||
supplied. The -p (preserve, see below) flag does not impact preservation of
|
||||
raw xattrs.
|
||||
|
||||
To prevent raw xattrs from being preserved, simply do not use the
|
||||
/.reserved/raw prefix on any of the source and target paths.
|
||||
`/.reserved/raw` prefix on any of the source and target paths.
|
||||
|
||||
If the /.reserved/raw prefix is specified on only a subset of the source and
|
||||
If the `/.reserved/raw `prefix is specified on only a subset of the source and
|
||||
target paths, an error will be displayed and a non-0 exit code returned.
|
||||
|
||||
Command Line Options
|
||||
|
@ -288,27 +288,27 @@ $H3 Copy-listing Generator
|
|||
that need copy into a SequenceFile, for consumption by the DistCp Hadoop
|
||||
Job. The main classes in this module include:
|
||||
|
||||
1. CopyListing: The interface that should be implemented by any
|
||||
1. `CopyListing`: The interface that should be implemented by any
|
||||
copy-listing-generator implementation. Also provides the factory method by
|
||||
which the concrete CopyListing implementation is chosen.
|
||||
2. SimpleCopyListing: An implementation of CopyListing that accepts multiple
|
||||
2. `SimpleCopyListing`: An implementation of `CopyListing` that accepts multiple
|
||||
source paths (files/directories), and recursively lists all the individual
|
||||
files and directories under each, for copy.
|
||||
3. GlobbedCopyListing: Another implementation of CopyListing that expands
|
||||
3. `GlobbedCopyListing`: Another implementation of `CopyListing` that expands
|
||||
wild-cards in the source paths.
|
||||
4. FileBasedCopyListing: An implementation of CopyListing that reads the
|
||||
4. `FileBasedCopyListing`: An implementation of `CopyListing` that reads the
|
||||
source-path list from a specified file.
|
||||
|
||||
Based on whether a source-file-list is specified in the DistCpOptions, the
|
||||
source-listing is generated in one of the following ways:
|
||||
|
||||
1. If there's no source-file-list, the GlobbedCopyListing is used. All
|
||||
1. If there's no source-file-list, the `GlobbedCopyListing` is used. All
|
||||
wild-cards are expanded, and all the expansions are forwarded to the
|
||||
SimpleCopyListing, which in turn constructs the listing (via recursive
|
||||
descent of each path).
|
||||
2. If a source-file-list is specified, the FileBasedCopyListing is used.
|
||||
2. If a source-file-list is specified, the `FileBasedCopyListing` is used.
|
||||
Source-paths are read from the specified file, and then forwarded to the
|
||||
GlobbedCopyListing. The listing is then constructed as described above.
|
||||
`GlobbedCopyListing`. The listing is then constructed as described above.
|
||||
|
||||
One may customize the method by which the copy-listing is constructed by
|
||||
providing a custom implementation of the CopyListing interface. The behaviour
|
||||
|
@ -343,13 +343,13 @@ $H3 InputFormats and MapReduce Components
|
|||
implementation keeps the setup-time low.
|
||||
|
||||
* **DynamicInputFormat and DynamicRecordReader:**
|
||||
The DynamicInputFormat implements org.apache.hadoop.mapreduce.InputFormat,
|
||||
The DynamicInputFormat implements `org.apache.hadoop.mapreduce.InputFormat`,
|
||||
and is new to DistCp. The listing-file is split into several "chunk-files",
|
||||
the exact number of chunk-files being a multiple of the number of maps
|
||||
requested for in the Hadoop Job. Each map task is "assigned" one of the
|
||||
chunk-files (by renaming the chunk to the task's id), before the Job is
|
||||
launched.
|
||||
Paths are read from each chunk using the DynamicRecordReader, and
|
||||
Paths are read from each chunk using the `DynamicRecordReader`, and
|
||||
processed in the CopyMapper. After all the paths in a chunk are processed,
|
||||
the current chunk is deleted and a new chunk is acquired. The process
|
||||
continues until no more chunks are available.
|
||||
|
@ -403,7 +403,7 @@ $H3 Map sizing
|
|||
slower nodes. While this distribution isn't uniform, it is fair with regard
|
||||
to each mapper's capacity.
|
||||
|
||||
The dynamic-strategy is implemented by the DynamicInputFormat. It provides
|
||||
The dynamic-strategy is implemented by the `DynamicInputFormat`. It provides
|
||||
superior performance under most conditions.
|
||||
|
||||
Tuning the number of maps to the size of the source and destination clusters,
|
||||
|
@ -432,6 +432,105 @@ $H3 MapReduce and other side-effects
|
|||
* If `mapreduce.map.speculative` is set set final and true, the result of the
|
||||
copy is undefined.
|
||||
|
||||
$H3 DistCp and Object Stores
|
||||
|
||||
DistCp works with Object Stores such as Amazon S3, Azure WASB and OpenStack Swift.
|
||||
|
||||
Prequisites
|
||||
|
||||
1. The JAR containing the object store implementation is on the classpath,
|
||||
along with all of its dependencies.
|
||||
1. Unless the JAR automatically registers its bundled filesystem clients,
|
||||
the configuration may need to be modified to state the class which
|
||||
implements the filesystem schema. All of the ASF's own object store clients
|
||||
are self-registering.
|
||||
1. The relevant object store access credentials must be available in the cluster
|
||||
configuration, or be otherwise available in all cluster hosts.
|
||||
|
||||
DistCp can be used to upload data
|
||||
|
||||
```bash
|
||||
hadoop distcp hdfs://nn1:8020/datasets/set1 s3a://bucket/datasets/set1
|
||||
```
|
||||
|
||||
To download data
|
||||
|
||||
```bash
|
||||
hadoop distcp s3a://bucket/generated/results hdfs://nn1:8020/results
|
||||
```
|
||||
|
||||
To copy data between object stores
|
||||
|
||||
```bash
|
||||
hadoop distcp s3a://bucket/generated/results \
|
||||
wasb://updates@example.blob.core.windows.net
|
||||
```
|
||||
|
||||
And do copy data within an object store
|
||||
|
||||
```bash
|
||||
hadoop distcp wasb://updates@example.blob.core.windows.net/current \
|
||||
wasb://updates@example.blob.core.windows.net/old
|
||||
```
|
||||
|
||||
And to use `-update` to only copy changed files.
|
||||
|
||||
```bash
|
||||
hadoop distcp -update -numListstatusThreads 20 \
|
||||
swift://history.cluster1/2016 \
|
||||
hdfs://nn1:8020/history/2016
|
||||
```
|
||||
|
||||
Because object stores are slow to list files, consider setting the `-numListstatusThreads` option when performing a `-update` operation
|
||||
on a large directory tree (the limit is 40 threads).
|
||||
|
||||
When `DistCp -update` is used with object stores,
|
||||
generally only the modification time and length of the individual files are compared,
|
||||
not any checksums. The fact that most object stores do have valid timestamps
|
||||
for directories is irrelevant; only the file timestamps are compared.
|
||||
However, it is important to have the clock of the client computers close
|
||||
to that of the infrastructure, so that timestamps are consistent between
|
||||
the client/HDFS cluster and that of the object store. Otherwise, changed files may be
|
||||
missed/copied too often.
|
||||
|
||||
**Notes**
|
||||
|
||||
* The `-atomic` option causes a rename of the temporary data, so significantly
|
||||
increases the time to commit work at the end of the operation. Furthermore,
|
||||
as Object Stores other than (optionally) `wasb://` do not offer atomic renames of directories
|
||||
the `-atomic` operation doesn't actually deliver what is promised. *Avoid*.
|
||||
|
||||
* The `-append` option is not supported.
|
||||
|
||||
* The `-diff` and `rdiff` options are not supported
|
||||
|
||||
* CRC checking will not be performed, irrespective of the value of the `-skipCrc`
|
||||
flag.
|
||||
|
||||
* All `-p` options, including those to preserve permissions, user and group information, attributes
|
||||
checksums and replication are generally ignored. The `wasb://` connector will
|
||||
preserve the information, but not enforce the permissions.
|
||||
|
||||
* Some object store connectors offer an option for in-memory buffering of
|
||||
output —for example the S3A connector. Using such option while copying
|
||||
large files may trigger some form of out of memory event,
|
||||
be it a heap overflow or a YARN container termination.
|
||||
This is particularly common if the network bandwidth
|
||||
between the cluster and the object store is limited (such as when working
|
||||
with remote object stores). It is best to disable/avoid such options and
|
||||
rely on disk buffering.
|
||||
|
||||
* Copy operations within a single object store still take place in the Hadoop cluster
|
||||
—even when the object store implements a more efficient COPY operation internally
|
||||
|
||||
That is, an operation such as
|
||||
|
||||
hadoop distcp s3a://bucket/datasets/set1 s3a://bucket/datasets/set2
|
||||
|
||||
Copies each byte down to the Hadoop worker nodes and back to the
|
||||
bucket. As well as being slow, it means that charges may be incurred.
|
||||
|
||||
|
||||
Frequently Asked Questions
|
||||
--------------------------
|
||||
|
||||
|
|
Loading…
Reference in New Issue