HDFS-7668. Convert site documentation from apt to markdown (Masatake Iwasaki via aw)

This commit is contained in:
Allen Wittenauer 2015-02-12 18:19:45 -08:00
parent 93b941c637
commit 2f1e5dc628
45 changed files with 7260 additions and 9907 deletions

View File

@ -141,6 +141,9 @@ Trunk (Unreleased)
HDFS-7322. deprecate sbin/hadoop-daemon.sh (aw)
HDFS-7668. Convert site documentation from apt to markdown (Masatake
Iwasaki via aw)
OPTIMIZATIONS
BUG FIXES

View File

@ -1,233 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
Archival Storage, SSD & Memory
---
---
${maven.build.timestamp}
Archival Storage, SSD & Memory
%{toc|section=1|fromDepth=0}
* {Introduction}
<Archival Storage> is a solution to decouple growing storage capacity from compute capacity.
Nodes with higher density and less expensive storage with low compute power are becoming available
and can be used as cold storage in the clusters.
Based on policy the data from hot can be moved to the cold.
Adding more nodes to the cold storage can grow the storage independent of the compute capacity
in the cluster.
The frameworks provided by Heterogeneous Storage and Archival Storage generalizes the HDFS architecture
to include other kinds of storage media including <SSD> and <memory>.
Users may choose to store their data in SSD or memory for a better performance.
* {Storage Types and Storage Policies}
** {Storage Types: ARCHIVE, DISK, SSD and RAM_DISK}
The first phase of
{{{https://issues.apache.org/jira/browse/HDFS-2832}Heterogeneous Storage (HDFS-2832)}}
changed datanode storage model from a single storage,
which may correspond to multiple physical storage medias,
to a collection of storages with each storage corresponding to a physical storage media.
It also added the notion of storage types, DISK and SSD,
where DISK is the default storage type.
A new storage type <ARCHIVE>,
which has high storage density (petabyte of storage) but little compute power,
is added for supporting archival storage.
Another new storage type <RAM_DISK> is added for supporting writing single replica files in memory.
** {Storage Policies: Hot, Warm, Cold, All_SSD, One_SSD and Lazy_Persist}
A new concept of storage policies is introduced in order to allow files to be stored
in different storage types according to the storage policy.
We have the following storage policies:
* <<Hot>> - for both storage and compute.
The data that is popular and still being used for processing will stay in this policy.
When a block is hot, all replicas are stored in DISK.
* <<Cold>> - only for storage with limited compute.
The data that is no longer being used, or data that needs to be archived is moved
from hot storage to cold storage.
When a block is cold, all replicas are stored in ARCHIVE.
* <<Warm>> - partially hot and partially cold.
When a block is warm, some of its replicas are stored in DISK
and the remaining replicas are stored in ARCHIVE.
* <<All_SSD>> - for storing all replicas in SSD.
* <<One_SSD>> - for storing one of the replicas in SSD.
The remaining replicas are stored in DISK.
* <<Lazy_Persist>> - for writing blocks with single replica in memory.
The replica is first written in RAM_DISK and then it is lazily persisted in DISK.
[]
More formally, a storage policy consists of the following fields:
[[1]] Policy ID
[[2]] Policy name
[[3]] A list of storage types for block placement
[[4]] A list of fallback storage types for file creation
[[5]] A list of fallback storage types for replication
[]
When there is enough space,
block replicas are stored according to the storage type list specified in #3.
When some of the storage types in list #3 are running out of space,
the fallback storage type lists specified in #4 and #5 are used
to replace the out-of-space storage types for file creation and replication, respectively.
The following is a typical storage policy table.
*--------+---------------+--------------------------+-----------------------+-----------------------+
| <<Policy>> | <<Policy>>| <<Block Placement>> | <<Fallback storages>> | <<Fallback storages>> |
| <<ID>> | <<Name>> | <<(n\ replicas)>> | <<for creation>> | <<for replication>> |
*--------+---------------+--------------------------+-----------------------+-----------------------+
| 15 | Lasy_Persist | RAM_DISK: 1, DISK: <n>-1 | DISK | DISK |
*--------+---------------+--------------------------+-----------------------+-----------------------+
| 12 | All_SSD | SSD: <n> | DISK | DISK |
*--------+---------------+--------------------------+-----------------------+-----------------------+
| 10 | One_SSD | SSD: 1, DISK: <n>-1 | SSD, DISK | SSD, DISK |
*--------+---------------+--------------------------+-----------------------+-----------------------+
| 7 | Hot (default) | DISK: <n> | \<none\> | ARCHIVE |
*--------+---------------+--------------------------+-----------------------+-----------------------+
| 5 | Warm | DISK: 1, ARCHIVE: <n>-1 | ARCHIVE, DISK | ARCHIVE, DISK |
*--------+---------------+--------------------------+-----------------------+-----------------------+
| 2 | Cold | ARCHIVE: <n> | \<none\> | \<none\> |
*--------+---------------+--------------------------+-----------------------+-----------------------+
Note that the Lasy_Persist policy is useful only for single replica blocks.
For blocks with more than one replicas, all the replicas will be written to DISK
since writing only one of the replicas to RAM_DISK does not improve the overall performance.
** {Storage Policy Resolution}
When a file or directory is created, its storage policy is <unspecified>.
The storage policy can be specified using
the "<<<{{{Set Storage Policy}dfsadmin -setStoragePolicy}}>>>" command.
The effective storage policy of a file or directory is resolved by the following rules.
[[1]] If the file or directory is specificed with a storage policy, return it.
[[2]] For an unspecified file or directory,
if it is the root directory, return the <default storage policy>.
Otherwise, return its parent's effective storage policy.
[]
The effective storage policy can be retrieved by
the "<<<{{{Set Storage Policy}dfsadmin -getStoragePolicy}}>>>" command.
** {Configuration}
* <<dfs.storage.policy.enabled>>
- for enabling/disabling the storage policy feature.
The default value is <<<true>>>.
[]
* {Mover - A New Data Migration Tool}
A new data migration tool is added for archiving data.
The tool is similar to Balancer.
It periodically scans the files in HDFS to check if the block placement satisfies the storage policy.
For the blocks violating the storage policy,
it moves the replicas to a different storage type
in order to fulfill the storage policy requirement.
* Command:
+------------------------------------------+
hdfs mover [-p <files/dirs> | -f <local file name>]
+------------------------------------------+
* Arguments:
*-------------------------+--------------------------------------------------------+
| <<<-p \<files/dirs\>>>> | Specify a space separated list of HDFS files/dirs to migrate.
*-------------------------+--------------------------------------------------------+
| <<<-f \<local file\>>>> | Specify a local file containing a list of HDFS files/dirs to migrate.
*-------------------------+--------------------------------------------------------+
Note that, when both -p and -f options are omitted, the default path is the root directory.
[]
* {Storage Policy Commands}
** {List Storage Policies}
List out all the storage policies.
* Command:
+------------------------------------------+
hdfs storagepolicies -listPolicies
+------------------------------------------+
* Arguments: none.
** {Set Storage Policy}
Set a storage policy to a file or a directory.
* Command:
+------------------------------------------+
hdfs storagepolicies -setStoragePolicy -path <path> -policy <policy>
+------------------------------------------+
* Arguments:
*--------------------------+-----------------------------------------------------+
| <<<-path \<path\>>>> | The path referring to either a directory or a file. |
*--------------------------+-----------------------------------------------------+
| <<<-policy \<policy\>>>> | The name of the storage policy. |
*--------------------------+-----------------------------------------------------+
[]
** {Get Storage Policy}
Get the storage policy of a file or a directory.
* Command:
+------------------------------------------+
hdfs storagepolicies -getStoragePolicy -path <path>
+------------------------------------------+
* Arguments:
*----------------------------+-----------------------------------------------------+
| <<<-path \<path\>>>> | The path referring to either a directory or a file. |
*----------------------------+-----------------------------------------------------+
[]

View File

@ -1,344 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
Hadoop Distributed File System-${project.version} - Centralized Cache Management in HDFS
---
---
${maven.build.timestamp}
Centralized Cache Management in HDFS
%{toc|section=1|fromDepth=2|toDepth=4}
* {Overview}
<Centralized cache management> in HDFS is an explicit caching mechanism that
allows users to specify <paths> to be cached by HDFS. The NameNode will
communicate with DataNodes that have the desired blocks on disk, and instruct
them to cache the blocks in off-heap caches.
Centralized cache management in HDFS has many significant advantages.
[[1]] Explicit pinning prevents frequently used data from being evicted from
memory. This is particularly important when the size of the working set
exceeds the size of main memory, which is common for many HDFS workloads.
[[1]] Because DataNode caches are managed by the NameNode, applications can
query the set of cached block locations when making task placement decisions.
Co-locating a task with a cached block replica improves read performance.
[[1]] When block has been cached by a DataNode, clients can use a new ,
more-efficient, zero-copy read API. Since checksum verification of cached
data is done once by the DataNode, clients can incur essentially zero
overhead when using this new API.
[[1]] Centralized caching can improve overall cluster memory utilization.
When relying on the OS buffer cache at each DataNode, repeated reads of
a block will result in all <n> replicas of the block being pulled into
buffer cache. With centralized cache management, a user can explicitly pin
only <m> of the <n> replicas, saving <n-m> memory.
* {Use Cases}
Centralized cache management is useful for files that accessed repeatedly.
For example, a small <fact table> in Hive which is often used for joins is a
good candidate for caching. On the other hand, caching the input of a <
one year reporting query> is probably less useful, since the
historical data might only be read once.
Centralized cache management is also useful for mixed workloads with
performance SLAs. Caching the working set of a high-priority workload
insures that it does not contend for disk I/O with a low-priority workload.
* {Architecture}
[images/caching.png] Caching Architecture
In this architecture, the NameNode is responsible for coordinating all the
DataNode off-heap caches in the cluster. The NameNode periodically receives
a <cache report> from each DataNode which describes all the blocks cached
on a given DN. The NameNode manages DataNode caches by piggybacking cache and
uncache commands on the DataNode heartbeat.
The NameNode queries its set of <cache directives> to determine
which paths should be cached. Cache directives are persistently stored in the
fsimage and edit log, and can be added, removed, and modified via Java and
command-line APIs. The NameNode also stores a set of <cache pools>,
which are administrative entities used to group cache directives together for
resource management and enforcing permissions.
The NameNode periodically rescans the namespace and active cache directives
to determine which blocks need to be cached or uncached and assign caching
work to DataNodes. Rescans can also be triggered by user actions like adding
or removing a cache directive or removing a cache pool.
We do not currently cache blocks which are under construction, corrupt, or
otherwise incomplete. If a cache directive covers a symlink, the symlink
target is not cached.
Caching is currently done on the file or directory-level. Block and sub-block
caching is an item of future work.
* {Concepts}
** {Cache directive}
A <cache directive> defines a path that should be cached. Paths can be either
directories or files. Directories are cached non-recursively, meaning only
files in the first-level listing of the directory.
Directives also specify additional parameters, such as the cache replication
factor and expiration time. The replication factor specifies the number of
block replicas to cache. If multiple cache directives refer to the same file,
the maximum cache replication factor is applied.
The expiration time is specified on the command line as a <time-to-live
(TTL)>, a relative expiration time in the future. After a cache directive
expires, it is no longer considered by the NameNode when making caching
decisions.
** {Cache pool}
A <cache pool> is an administrative entity used to manage groups of cache
directives. Cache pools have UNIX-like <permissions>, which restrict which
users and groups have access to the pool. Write permissions allow users to
add and remove cache directives to the pool. Read permissions allow users to
list the cache directives in a pool, as well as additional metadata. Execute
permissions are unused.
Cache pools are also used for resource management. Pools can enforce a
maximum <limit>, which restricts the number of bytes that can be cached in
aggregate by directives in the pool. Normally, the sum of the pool limits
will approximately equal the amount of aggregate memory reserved for
HDFS caching on the cluster. Cache pools also track a number of statistics
to help cluster users determine what is and should be cached.
Pools also can enforce a maximum time-to-live. This restricts the maximum
expiration time of directives being added to the pool.
* {<<<cacheadmin>>> command-line interface}
On the command-line, administrators and users can interact with cache pools
and directives via the <<<hdfs cacheadmin>>> subcommand.
Cache directives are identified by a unique, non-repeating 64-bit integer ID.
IDs will not be reused even if a cache directive is later removed.
Cache pools are identified by a unique string name.
** {Cache directive commands}
*** {addDirective}
Usage: <<<hdfs cacheadmin -addDirective -path <path> -pool <pool-name> [-force] [-replication <replication>] [-ttl <time-to-live>]>>>
Add a new cache directive.
*--+--+
\<path\> | A path to cache. The path can be a directory or a file.
*--+--+
\<pool-name\> | The pool to which the directive will be added. You must have write permission on the cache pool in order to add new directives.
*--+--+
-force | Skips checking of cache pool resource limits.
*--+--+
\<replication\> | The cache replication factor to use. Defaults to 1.
*--+--+
\<time-to-live\> | How long the directive is valid. Can be specified in minutes, hours, and days, e.g. 30m, 4h, 2d. Valid units are [smhd]. "never" indicates a directive that never expires. If unspecified, the directive never expires.
*--+--+
*** {removeDirective}
Usage: <<<hdfs cacheadmin -removeDirective <id> >>>
Remove a cache directive.
*--+--+
\<id\> | The id of the cache directive to remove. You must have write permission on the pool of the directive in order to remove it. To see a list of cachedirective IDs, use the -listDirectives command.
*--+--+
*** {removeDirectives}
Usage: <<<hdfs cacheadmin -removeDirectives <path> >>>
Remove every cache directive with the specified path.
*--+--+
\<path\> | The path of the cache directives to remove. You must have write permission on the pool of the directive in order to remove it. To see a list of cache directives, use the -listDirectives command.
*--+--+
*** {listDirectives}
Usage: <<<hdfs cacheadmin -listDirectives [-stats] [-path <path>] [-pool <pool>]>>>
List cache directives.
*--+--+
\<path\> | List only cache directives with this path. Note that if there is a cache directive for <path> in a cache pool that we don't have read access for, it will not be listed.
*--+--+
\<pool\> | List only path cache directives in that pool.
*--+--+
-stats | List path-based cache directive statistics.
*--+--+
** {Cache pool commands}
*** {addPool}
Usage: <<<hdfs cacheadmin -addPool <name> [-owner <owner>] [-group <group>] [-mode <mode>] [-limit <limit>] [-maxTtl <maxTtl>>>>
Add a new cache pool.
*--+--+
\<name\> | Name of the new pool.
*--+--+
\<owner\> | Username of the owner of the pool. Defaults to the current user.
*--+--+
\<group\> | Group of the pool. Defaults to the primary group name of the current user.
*--+--+
\<mode\> | UNIX-style permissions for the pool. Permissions are specified in octal, e.g. 0755. By default, this is set to 0755.
*--+--+
\<limit\> | The maximum number of bytes that can be cached by directives in this pool, in aggregate. By default, no limit is set.
*--+--+
\<maxTtl\> | The maximum allowed time-to-live for directives being added to the pool. This can be specified in seconds, minutes, hours, and days, e.g. 120s, 30m, 4h, 2d. Valid units are [smhd]. By default, no maximum is set. A value of \"never\" specifies that there is no limit.
*--+--+
*** {modifyPool}
Usage: <<<hdfs cacheadmin -modifyPool <name> [-owner <owner>] [-group <group>] [-mode <mode>] [-limit <limit>] [-maxTtl <maxTtl>]>>>
Modifies the metadata of an existing cache pool.
*--+--+
\<name\> | Name of the pool to modify.
*--+--+
\<owner\> | Username of the owner of the pool.
*--+--+
\<group\> | Groupname of the group of the pool.
*--+--+
\<mode\> | Unix-style permissions of the pool in octal.
*--+--+
\<limit\> | Maximum number of bytes that can be cached by this pool.
*--+--+
\<maxTtl\> | The maximum allowed time-to-live for directives being added to the pool.
*--+--+
*** {removePool}
Usage: <<<hdfs cacheadmin -removePool <name> >>>
Remove a cache pool. This also uncaches paths associated with the pool.
*--+--+
\<name\> | Name of the cache pool to remove.
*--+--+
*** {listPools}
Usage: <<<hdfs cacheadmin -listPools [-stats] [<name>]>>>
Display information about one or more cache pools, e.g. name, owner, group,
permissions, etc.
*--+--+
-stats | Display additional cache pool statistics.
*--+--+
\<name\> | If specified, list only the named cache pool.
*--+--+
*** {help}
Usage: <<<hdfs cacheadmin -help <command-name> >>>
Get detailed help about a command.
*--+--+
\<command-name\> | The command for which to get detailed help. If no command is specified, print detailed help for all commands.
*--+--+
* {Configuration}
** {Native Libraries}
In order to lock block files into memory, the DataNode relies on native JNI
code found in <<<libhadoop.so>>> or <<<hadoop.dll>>> on Windows. Be sure to
{{{../hadoop-common/NativeLibraries.html}enable JNI}} if you are using HDFS
centralized cache management.
** {Configuration Properties}
*** Required
Be sure to configure the following:
* dfs.datanode.max.locked.memory
This determines the maximum amount of memory a DataNode will use for caching.
On Unix-like systems, the "locked-in-memory size" ulimit (<<<ulimit -l>>>) of
the DataNode user also needs to be increased to match this parameter (see
below section on {{OS Limits}}). When setting this value, please remember
that you will need space in memory for other things as well, such as the
DataNode and application JVM heaps and the operating system page cache.
*** Optional
The following properties are not required, but may be specified for tuning:
* dfs.namenode.path.based.cache.refresh.interval.ms
The NameNode will use this as the amount of milliseconds between subsequent
path cache rescans. This calculates the blocks to cache and each DataNode
containing a replica of the block that should cache it.
By default, this parameter is set to 300000, which is five minutes.
* dfs.datanode.fsdatasetcache.max.threads.per.volume
The DataNode will use this as the maximum number of threads per volume to
use for caching new data.
By default, this parameter is set to 4.
* dfs.cachereport.intervalMsec
The DataNode will use this as the amount of milliseconds between sending a
full report of its cache state to the NameNode.
By default, this parameter is set to 10000, which is 10 seconds.
* dfs.namenode.path.based.cache.block.map.allocation.percent
The percentage of the Java heap which we will allocate to the cached blocks
map. The cached blocks map is a hash map which uses chained hashing.
Smaller maps may be accessed more slowly if the number of cached blocks is
large; larger maps will consume more memory. The default is 0.25 percent.
** {OS Limits}
If you get the error "Cannot start datanode because the configured max
locked memory size... is more than the datanode's available RLIMIT_MEMLOCK
ulimit," that means that the operating system is imposing a lower limit
on the amount of memory that you can lock than what you have configured. To
fix this, you must adjust the ulimit -l value that the DataNode runs with.
Usually, this value is configured in <<</etc/security/limits.conf>>>.
However, it will vary depending on what operating system and distribution
you are using.
You will know that you have correctly configured this value when you can run
<<<ulimit -l>>> from the shell and get back either a higher value than what
you have configured with <<<dfs.datanode.max.locked.memory>>>, or the string
"unlimited," indicating that there is no limit. Note that it's typical for
<<<ulimit -l>>> to output the memory lock limit in KB, but
dfs.datanode.max.locked.memory must be specified in bytes.
This information does not apply to deployments on Windows. Windows has no
direct equivalent of <<<ulimit -l>>>.

View File

@ -1,97 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
Hadoop Distributed File System-${project.version} - Extended Attributes
---
---
${maven.build.timestamp}
Extended Attributes in HDFS
%{toc|section=1|fromDepth=2|toDepth=4}
* {Overview}
<Extended attributes> (abbreviated as <xattrs>) are a filesystem feature that allow user applications to associate additional metadata with a file or directory. Unlike system-level inode metadata such as file permissions or modification time, extended attributes are not interpreted by the system and are instead used by applications to store additional information about an inode. Extended attributes could be used, for instance, to specify the character encoding of a plain-text document.
** {HDFS extended attributes}
Extended attributes in HDFS are modeled after extended attributes in Linux (see the Linux manpage for {{{http://www.bestbits.at/acl/man/man5/attr.txt}attr(5)}} and {{{http://www.bestbits.at/acl/}related documentation}}). An extended attribute is a <name-value pair>, with a string name and binary value. Xattrs names must also be prefixed with a <namespace>. For example, an xattr named <myXattr> in the <user> namespace would be specified as <<user.myXattr>>. Multiple xattrs can be associated with a single inode.
** {Namespaces and Permissions}
In HDFS, there are five valid namespaces: <<<user>>>, <<<trusted>>>, <<<system>>>, <<<security>>>, and <<<raw>>>. Each of these namespaces have different access restrictions.
The <<<user>>> namespace is the namespace that will commonly be used by client applications. Access to extended attributes in the user namespace is controlled by the corresponding file permissions.
The <<<trusted>>> namespace is available only to HDFS superusers.
The <<<system>>> namespace is reserved for internal HDFS use. This namespace is not accessible through userspace methods, and is reserved for implementing internal HDFS features.
The <<<security>>> namespace is reserved for internal HDFS use. This namespace is generally not accessible through userspace methods. One particular use of <<<security>>> is the <<<security.hdfs.unreadable.by.superuser>>> extended attribute. This xattr can only be set on files, and it will prevent the superuser from reading the file's contents. The superuser can still read and modify file metadata, such as the owner, permissions, etc. This xattr can be set and accessed by any user, assuming normal filesystem permissions. This xattr is also write-once, and cannot be removed once set. This xattr does not allow a value to be set.
The <<<raw>>> namespace is reserved for internal system attributes that sometimes need to be exposed. Like <<<system>>> namespace attributes they are not visible to the user except when <<<getXAttr>>>/<<<getXAttrs>>> is called on a file or directory in the <<</.reserved/raw>>> HDFS directory hierarchy. These attributes can only be accessed by the superuser. An example of where <<<raw>>> namespace extended attributes are used is the <<<distcp>>> utility. Encryption zone meta data is stored in <<<raw.*>>> extended attributes, so as long as the administrator uses <<</.reserved/raw>>> pathnames in source and target, the encrypted files in the encryption zones are transparently copied.
* {Interacting with extended attributes}
The Hadoop shell has support for interacting with extended attributes via <<<hadoop fs -getfattr>>> and <<<hadoop fs -setfattr>>>. These commands are styled after the Linux {{{http://www.bestbits.at/acl/man/man1/getfattr.txt}getfattr(1)}} and {{{http://www.bestbits.at/acl/man/man1/setfattr.txt}setfattr(1)}} commands.
** {getfattr}
<<<hadoop fs -getfattr [-R] {-n name | -d} [-e en] <path>>>>
Displays the extended attribute names and values (if any) for a file or directory.
*--+--+
-R | Recursively list the attributes for all files and directories.
*--+--+
-n name | Dump the named extended attribute value.
*--+--+
-d | Dump all extended attribute values associated with pathname.
*--+--+
-e \<encoding\> | Encode values after retrieving them. Valid encodings are "text", "hex", and "base64". Values encoded as text strings are enclosed in double quotes ("), and values encoded as hexadecimal and base64 are prefixed with 0x and 0s, respectively.
*--+--+
\<path\> | The file or directory.
*--+--+
** {setfattr}
<<<hadoop fs -setfattr {-n name [-v value] | -x name} <path>>>>
Sets an extended attribute name and value for a file or directory.
*--+--+
-n name | The extended attribute name.
*--+--+
-v value | The extended attribute value. There are three different encoding methods for the value. If the argument is enclosed in double quotes, then the value is the string inside the quotes. If the argument is prefixed with 0x or 0X, then it is taken as a hexadecimal number. If the argument begins with 0s or 0S, then it is taken as a base64 encoding.
*--+--+
-x name | Remove the extended attribute.
*--+--+
\<path\> | The file or directory.
*--+--+
* {Configuration options}
HDFS supports extended attributes out of the box, without additional configuration. Administrators could potentially be interested in the options limiting the number of xattrs per inode and the size of xattrs, since xattrs increase the on-disk and in-memory space consumption of an inode.
* <<<dfs.namenode.xattrs.enabled>>>
Whether support for extended attributes is enabled on the NameNode. By default, extended attributes are enabled.
* <<<dfs.namenode.fs-limits.max-xattrs-per-inode>>>
The maximum number of extended attributes per inode. By default, this limit is 32.
* <<<dfs.namenode.fs-limits.max-xattr-size>>>
The maximum combined size of the name and value of an extended attribute in bytes. By default, this limit is 16384 bytes.

View File

@ -1,312 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
Fault Injection Framework and Development Guide
---
---
${maven.build.timestamp}
Fault Injection Framework and Development Guide
%{toc|section=1|fromDepth=0}
* Introduction
This guide provides an overview of the Hadoop Fault Injection (FI)
framework for those who will be developing their own faults (aspects).
The idea of fault injection is fairly simple: it is an infusion of
errors and exceptions into an application's logic to achieve a higher
coverage and fault tolerance of the system. Different implementations
of this idea are available today. Hadoop's FI framework is built on top
of Aspect Oriented Paradigm (AOP) implemented by AspectJ toolkit.
* Assumptions
The current implementation of the FI framework assumes that the faults
it will be emulating are of non-deterministic nature. That is, the
moment of a fault's happening isn't known in advance and is a coin-flip
based.
* Architecture of the Fault Injection Framework
Components layout
** Configuration Management
This piece of the FI framework allows you to set expectations for
faults to happen. The settings can be applied either statically (in
advance) or in runtime. The desired level of faults in the framework
can be configured two ways:
* editing src/aop/fi-site.xml configuration file. This file is
similar to other Hadoop's config files
* setting system properties of JVM through VM startup parameters or
in build.properties file
** Probability Model
This is fundamentally a coin flipper. The methods of this class are
getting a random number between 0.0 and 1.0 and then checking if a new
number has happened in the range of 0.0 and a configured level for the
fault in question. If that condition is true then the fault will occur.
Thus, to guarantee the happening of a fault one needs to set an
appropriate level to 1.0. To completely prevent a fault from happening
its probability level has to be set to 0.0.
Note: The default probability level is set to 0 (zero) unless the level
is changed explicitly through the configuration file or in the runtime.
The name of the default level's configuration parameter is fi.*
** Fault Injection Mechanism: AOP and AspectJ
The foundation of Hadoop's FI framework includes a cross-cutting
concept implemented by AspectJ. The following basic terms are important
to remember:
* A cross-cutting concept (aspect) is behavior, and often data, that
is used across the scope of a piece of software
* In AOP, the aspects provide a mechanism by which a cross-cutting
concern can be specified in a modular way
* Advice is the code that is executed when an aspect is invoked
* Join point (or pointcut) is a specific point within the application
that may or not invoke some advice
** Existing Join Points
The following readily available join points are provided by AspectJ:
* Join when a method is called
* Join during a method's execution
* Join when a constructor is invoked
* Join during a constructor's execution
* Join during aspect advice execution
* Join before an object is initialized
* Join during object initialization
* Join during static initializer execution
* Join when a class's field is referenced
* Join when a class's field is assigned
* Join when a handler is executed
* Aspect Example
----
package org.apache.hadoop.hdfs.server.datanode;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.fi.ProbabilityModel;
import org.apache.hadoop.hdfs.server.datanode.DataNode;
import org.apache.hadoop.util.DiskChecker.*;
import java.io.IOException;
import java.io.OutputStream;
import java.io.DataOutputStream;
/**
* This aspect takes care about faults injected into datanode.BlockReceiver
* class
*/
public aspect BlockReceiverAspects {
public static final Log LOG = LogFactory.getLog(BlockReceiverAspects.class);
public static final String BLOCK_RECEIVER_FAULT="hdfs.datanode.BlockReceiver";
pointcut callReceivePacket() : call (* OutputStream.write(..))
&& withincode (* BlockReceiver.receivePacket(..))
// to further limit the application of this aspect a very narrow 'target' can be used as follows
// && target(DataOutputStream)
&& !within(BlockReceiverAspects +);
before () throws IOException : callReceivePacket () {
if (ProbabilityModel.injectCriteria(BLOCK_RECEIVER_FAULT)) {
LOG.info("Before the injection point");
Thread.dumpStack();
throw new DiskOutOfSpaceException ("FI: injected fault point at " +
thisJoinPoint.getStaticPart( ).getSourceLocation());
}
}
}
----
The aspect has two main parts:
* The join point pointcut callReceivepacket() which servers as an
identification mark of a specific point (in control and/or data
flow) in the life of an application.
* A call to the advice - before () throws IOException :
callReceivepacket() - will be injected (see Putting It All
Together) before that specific spot of the application's code.
The pointcut identifies an invocation of class' java.io.OutputStream
write() method with any number of parameters and any return type. This
invoke should take place within the body of method receivepacket() from
classBlockReceiver. The method can have any parameters and any return
type. Possible invocations of write() method happening anywhere within
the aspect BlockReceiverAspects or its heirs will be ignored.
Note 1: This short example doesn't illustrate the fact that you can
have more than a single injection point per class. In such a case the
names of the faults have to be different if a developer wants to
trigger them separately.
Note 2: After the injection step (see Putting It All Together) you can
verify that the faults were properly injected by searching for ajc
keywords in a disassembled class file.
* Fault Naming Convention and Namespaces
For the sake of a unified naming convention the following two types of
names are recommended for a new aspects development:
* Activity specific notation (when we don't care about a particular
location of a fault's happening). In this case the name of the
fault is rather abstract: fi.hdfs.DiskError
* Location specific notation. Here, the fault's name is mnemonic as
in: fi.hdfs.datanode.BlockReceiver[optional location details]
* Development Tools
* The Eclipse AspectJ Development Toolkit may help you when
developing aspects
* IntelliJ IDEA provides AspectJ weaver and Spring-AOP plugins
* Putting It All Together
Faults (aspects) have to injected (or woven) together before they can
be used. Follow these instructions:
* To weave aspects in place use:
----
% ant injectfaults
----
* If you misidentified the join point of your aspect you will see a
warning (similar to the one shown here) when 'injectfaults' target
is completed:
----
[iajc] warning at
src/test/aop/org/apache/hadoop/hdfs/server/datanode/ \
BlockReceiverAspects.aj:44::0
advice defined in org.apache.hadoop.hdfs.server.datanode.BlockReceiverAspects
has not been applied [Xlint:adviceDidNotMatch]
----
* It isn't an error, so the build will report the successful result.
To prepare dev.jar file with all your faults weaved in place
(HDFS-475 pending) use:
----
% ant jar-fault-inject
----
* To create test jars use:
----
% ant jar-test-fault-inject
----
* To run HDFS tests with faults injected use:
----
% ant run-test-hdfs-fault-inject
----
** How to Use the Fault Injection Framework
Faults can be triggered as follows:
* During runtime:
----
% ant run-test-hdfs -Dfi.hdfs.datanode.BlockReceiver=0.12
----
To set a certain level, for example 25%, of all injected faults
use:
----
% ant run-test-hdfs-fault-inject -Dfi.*=0.25
----
* From a program:
----
package org.apache.hadoop.fs;
import org.junit.Test;
import org.junit.Before;
public class DemoFiTest {
public static final String BLOCK_RECEIVER_FAULT="hdfs.datanode.BlockReceiver";
@Override
@Before
public void setUp() {
//Setting up the test's environment as required
}
@Test
public void testFI() {
// It triggers the fault, assuming that there's one called 'hdfs.datanode.BlockReceiver'
System.setProperty("fi." + BLOCK_RECEIVER_FAULT, "0.12");
//
// The main logic of your tests goes here
//
// Now set the level back to 0 (zero) to prevent this fault from happening again
System.setProperty("fi." + BLOCK_RECEIVER_FAULT, "0.0");
// or delete its trigger completely
System.getProperties().remove("fi." + BLOCK_RECEIVER_FAULT);
}
@Override
@After
public void tearDown() {
//Cleaning up test test environment
}
}
----
As you can see above these two methods do the same thing. They are
setting the probability level of <<<hdfs.datanode.BlockReceiver>>> at 12%.
The difference, however, is that the program provides more flexibility
and allows you to turn a fault off when a test no longer needs it.
* Additional Information and Contacts
These two sources of information are particularly interesting and worth
reading:
* {{http://www.eclipse.org/aspectj/doc/next/devguide/}}
* AspectJ Cookbook (ISBN-13: 978-0-596-00654-9)
If you have additional comments or questions for the author check
{{{https://issues.apache.org/jira/browse/HDFS-435}HDFS-435}}.

View File

@ -1,339 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
Hadoop Distributed File System-${project.version} - Federation
---
---
${maven.build.timestamp}
HDFS Federation
%{toc|section=1|fromDepth=0}
This guide provides an overview of the HDFS Federation feature and
how to configure and manage the federated cluster.
* {Background}
[./images/federation-background.gif] HDFS Layers
HDFS has two main layers:
* <<Namespace>>
* Consists of directories, files and blocks.
* It supports all the namespace related file system operations such as
create, delete, modify and list files and directories.
* <<Block Storage Service>>, which has two parts:
* Block Management (performed in the Namenode)
* Provides Datanode cluster membership by handling registrations, and
periodic heart beats.
* Processes block reports and maintains location of blocks.
* Supports block related operations such as create, delete, modify and
get block location.
* Manages replica placement, block replication for under
replicated blocks, and deletes blocks that are over replicated.
* Storage - is provided by Datanodes by storing blocks on the local file
system and allowing read/write access.
The prior HDFS architecture allows only a single namespace for the
entire cluster. In that configuration, a single Namenode manages the
namespace. HDFS Federation addresses this limitation by adding
support for multiple Namenodes/namespaces to HDFS.
* {Multiple Namenodes/Namespaces}
In order to scale the name service horizontally, federation uses multiple
independent Namenodes/namespaces. The Namenodes are federated; the
Namenodes are independent and do not require coordination with each other.
The Datanodes are used as common storage for blocks by all the Namenodes.
Each Datanode registers with all the Namenodes in the cluster. Datanodes
send periodic heartbeats and block reports. They also handle
commands from the Namenodes.
Users may use {{{./ViewFs.html}ViewFs}} to create personalized namespace views.
ViewFs is analogous to client side mount tables in some Unix/Linux systems.
[./images/federation.gif] HDFS Federation Architecture
<<Block Pool>>
A Block Pool is a set of blocks that belong to a single namespace.
Datanodes store blocks for all the block pools in the cluster. Each
Block Pool is managed independently. This allows a namespace to
generate Block IDs for new blocks without the need for coordination
with the other namespaces. A Namenode failure does not prevent the
Datanode from serving other Namenodes in the cluster.
A Namespace and its block pool together are called Namespace Volume.
It is a self-contained unit of management. When a Namenode/namespace
is deleted, the corresponding block pool at the Datanodes is deleted.
Each namespace volume is upgraded as a unit, during cluster upgrade.
<<ClusterID>>
A <<ClusterID>> identifier is used to identify all the nodes in the
cluster. When a Namenode is formatted, this identifier is either
provided or auto generated. This ID should be used for formatting
the other Namenodes into the cluster.
** Key Benefits
* Namespace Scalability - Federation adds namespace horizontal
scaling. Large deployments or deployments using lot of small files
benefit from namespace scaling by allowing more Namenodes to be
added to the cluster.
* Performance - File system throughput is not limited by a single
Namenode. Adding more Namenodes to the cluster scales the file
system read/write throughput.
* Isolation - A single Namenode offers no isolation in a multi user
environment. For example, an experimental application can overload
the Namenode and slow down production critical applications. By using
multiple Namenodes, different categories of applications and users
can be isolated to different namespaces.
* {Federation Configuration}
Federation configuration is <<backward compatible>> and allows
existing single Namenode configurations to work without any
change. The new configuration is designed such that all the nodes in
the cluster have the same configuration without the need for
deploying different configurations based on the type of the node in
the cluster.
Federation adds a new <<<NameServiceID>>> abstraction. A Namenode
and its corresponding secondary/backup/checkpointer nodes all belong
to a NameServiceId. In order to support a single configuration file,
the Namenode and secondary/backup/checkpointer configuration
parameters are suffixed with the <<<NameServiceID>>>.
** Configuration:
<<Step 1>>: Add the <<<dfs.nameservices>>> parameter to your
configuration and configure it with a list of comma separated
NameServiceIDs. This will be used by the Datanodes to determine the
Namenodes in the cluster.
<<Step 2>>: For each Namenode and Secondary Namenode/BackupNode/Checkpointer
add the following configuration parameters suffixed with the corresponding
<<<NameServiceID>>> into the common configuration file:
*---------------------+--------------------------------------------+
|| Daemon || Configuration Parameter |
*---------------------+--------------------------------------------+
| Namenode | <<<dfs.namenode.rpc-address>>> |
| | <<<dfs.namenode.servicerpc-address>>> |
| | <<<dfs.namenode.http-address>>> |
| | <<<dfs.namenode.https-address>>> |
| | <<<dfs.namenode.keytab.file>>> |
| | <<<dfs.namenode.name.dir>>> |
| | <<<dfs.namenode.edits.dir>>> |
| | <<<dfs.namenode.checkpoint.dir>>> |
| | <<<dfs.namenode.checkpoint.edits.dir>>> |
*---------------------+--------------------------------------------+
| Secondary Namenode | <<<dfs.namenode.secondary.http-address>>> |
| | <<<dfs.secondary.namenode.keytab.file>>> |
*---------------------+--------------------------------------------+
| BackupNode | <<<dfs.namenode.backup.address>>> |
| | <<<dfs.secondary.namenode.keytab.file>>> |
*---------------------+--------------------------------------------+
Here is an example configuration with two Namenodes:
----
<configuration>
<property>
<name>dfs.nameservices</name>
<value>ns1,ns2</value>
</property>
<property>
<name>dfs.namenode.rpc-address.ns1</name>
<value>nn-host1:rpc-port</value>
</property>
<property>
<name>dfs.namenode.http-address.ns1</name>
<value>nn-host1:http-port</value>
</property>
<property>
<name>dfs.namenode.secondaryhttp-address.ns1</name>
<value>snn-host1:http-port</value>
</property>
<property>
<name>dfs.namenode.rpc-address.ns2</name>
<value>nn-host2:rpc-port</value>
</property>
<property>
<name>dfs.namenode.http-address.ns2</name>
<value>nn-host2:http-port</value>
</property>
<property>
<name>dfs.namenode.secondaryhttp-address.ns2</name>
<value>snn-host2:http-port</value>
</property>
.... Other common configuration ...
</configuration>
----
** Formatting Namenodes
<<Step 1>>: Format a Namenode using the following command:
----
[hdfs]$ $HADOOP_PREFIX/bin/hdfs namenode -format [-clusterId <cluster_id>]
----
Choose a unique cluster_id which will not conflict other clusters in
your environment. If a cluster_id is not provided, then a unique one is
auto generated.
<<Step 2>>: Format additional Namenodes using the following command:
----
[hdfs]$ $HADOOP_PREFIX/bin/hdfs namenode -format -clusterId <cluster_id>
----
Note that the cluster_id in step 2 must be same as that of the
cluster_id in step 1. If they are different, the additional Namenodes
will not be part of the federated cluster.
** Upgrading from an older release and configuring federation
Older releases only support a single Namenode.
Upgrade the cluster to newer release in order to enable federation
During upgrade you can provide a ClusterID as follows:
----
[hdfs]$ $HADOOP_PREFIX/bin/hdfs --daemon start namenode -upgrade -clusterId <cluster_ID>
----
If cluster_id is not provided, it is auto generated.
** Adding a new Namenode to an existing HDFS cluster
Perform the following steps:
* Add <<<dfs.nameservices>>> to the configuration.
* Update the configuration with the NameServiceID suffix. Configuration
key names changed post release 0.20. You must use the new configuration
parameter names in order to use federation.
* Add the new Namenode related config to the configuration file.
* Propagate the configuration file to the all the nodes in the cluster.
* Start the new Namenode and Secondary/Backup.
* Refresh the Datanodes to pickup the newly added Namenode by running
the following command against all the Datanodes in the cluster:
----
[hdfs]$ $HADOOP_PREFIX/bin/hdfs dfsadmin -refreshNameNode <datanode_host_name>:<datanode_rpc_port>
----
* {Managing the cluster}
** Starting and stopping cluster
To start the cluster run the following command:
----
[hdfs]$ $HADOOP_PREFIX/sbin/start-dfs.sh
----
To stop the cluster run the following command:
----
[hdfs]$ $HADOOP_PREFIX/sbin/stop-dfs.sh
----
These commands can be run from any node where the HDFS configuration is
available. The command uses the configuration to determine the Namenodes
in the cluster and then starts the Namenode process on those nodes. The
Datanodes are started on the nodes specified in the <<<slaves>>> file. The
script can be used as a reference for building your own scripts to
start and stop the cluster.
** Balancer
The Balancer has been changed to work with multiple
Namenodes. The Balancer can be run using the command:
----
[hdfs]$ $HADOOP_PREFIX/bin/hdfs --daemon start balancer [-policy <policy>]
----
The policy parameter can be any of the following:
* <<<datanode>>> - this is the <default> policy. This balances the storage at
the Datanode level. This is similar to balancing policy from prior releases.
* <<<blockpool>>> - this balances the storage at the block pool
level which also balances at the Datanode level.
Note that Balancer only balances the data and does not balance the namespace.
For the complete command usage, see {{{../hadoop-common/CommandsManual.html#balancer}balancer}}.
** Decommissioning
Decommissioning is similar to prior releases. The nodes that need to be
decomissioned are added to the exclude file at all of the Namenodes. Each
Namenode decommissions its Block Pool. When all the Namenodes finish
decommissioning a Datanode, the Datanode is considered decommissioned.
<<Step 1>>: To distribute an exclude file to all the Namenodes, use the
following command:
----
[hdfs]$ $HADOOP_PREFIX/sbin/distribute-exclude.sh <exclude_file>
----
<<Step 2>>: Refresh all the Namenodes to pick up the new exclude file:
----
[hdfs]$ $HADOOP_PREFIX/sbin/refresh-namenodes.sh
----
The above command uses HDFS configuration to determine the
configured Namenodes in the cluster and refreshes them to pick up
the new exclude file.
** Cluster Web Console
Similar to the Namenode status web page, when using federation a
Cluster Web Console is available to monitor the federated cluster at
<<<http://<any_nn_host:port>/dfsclusterhealth.jsp>>>.
Any Namenode in the cluster can be used to access this web page.
The Cluster Web Console provides the following information:
* A cluster summary that shows the number of files, number of blocks,
total configured storage capacity, and the available and used storage
for the entire cluster.
* A list of Namenodes and a summary that includes the number of files,
blocks, missing blocks, and live and dead data nodes for each
Namenode. It also provides a link to access each Namenode's web UI.
* The decommissioning status of Datanodes.

View File

@ -1,797 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
HDFS Commands Guide
---
---
${maven.build.timestamp}
HDFS Commands Guide
%{toc|section=1|fromDepth=2|toDepth=3}
* Overview
All HDFS commands are invoked by the <<<bin/hdfs>>> script. Running the
hdfs script without any arguments prints the description for all
commands.
Usage: <<<hdfs [SHELL_OPTIONS] COMMAND [GENERIC_OPTIONS] [COMMAND_OPTIONS]>>>
Hadoop has an option parsing framework that employs parsing generic options as
well as running classes.
*---------------+--------------+
|| COMMAND_OPTIONS || Description |
*-------------------------+-------------+
| SHELL_OPTIONS | The common set of shell options. These are documented on the {{{../../hadoop-project-dist/hadoop-common/CommandsManual.html#Shell Options}Commands Manual}} page.
*-------------------------+----+
| GENERIC_OPTIONS | The common set of options supported by multiple commands. See the Hadoop {{{../../hadoop-project-dist/hadoop-common/CommandsManual.html#Generic Options}Commands Manual}} for more information.
*------------------+---------------+
| COMMAND COMMAND_OPTIONS | Various commands with their options are described
| | in the following sections. The commands have been
| | grouped into {{User Commands}} and
| | {{Administration Commands}}.
*-------------------------+--------------+
* {User Commands}
Commands useful for users of a hadoop cluster.
** <<<classpath>>>
Usage: <<<hdfs classpath>>>
Prints the class path needed to get the Hadoop jar and the required libraries
** <<<dfs>>>
Usage: <<<hdfs dfs [COMMAND [COMMAND_OPTIONS]]>>>
Run a filesystem command on the file system supported in Hadoop.
The various COMMAND_OPTIONS can be found at
{{{../hadoop-common/FileSystemShell.html}File System Shell Guide}}.
** <<<fetchdt>>>
Usage: <<<hdfs fetchdt [--webservice <namenode_http_addr>] <path> >>>
*------------------------------+---------------------------------------------+
|| COMMAND_OPTION || Description
*------------------------------+---------------------------------------------+
| --webservice <https_address> | use http protocol instead of RPC
*------------------------------+---------------------------------------------+
| <fileName> | File name to store the token into.
*------------------------------+---------------------------------------------+
Gets Delegation Token from a NameNode.
See {{{./HdfsUserGuide.html#fetchdt}fetchdt}} for more info.
** <<<fsck>>>
Usage:
---
hdfs fsck <path>
[-list-corruptfileblocks |
[-move | -delete | -openforwrite]
[-files [-blocks [-locations | -racks]]]
[-includeSnapshots] [-showprogress]
---
*------------------------+---------------------------------------------+
|| COMMAND_OPTION || Description
*------------------------+---------------------------------------------+
| <path> | Start checking from this path.
*------------------------+---------------------------------------------+
| -delete | Delete corrupted files.
*------------------------+---------------------------------------------+
| -files | Print out files being checked.
*------------------------+---------------------------------------------+
| -files -blocks | Print out the block report
*------------------------+---------------------------------------------+
| -files -blocks -locations | Print out locations for every block.
*------------------------+---------------------------------------------+
| -files -blocks -racks | Print out network topology for data-node locations.
*------------------------+---------------------------------------------+
| | Include snapshot data if the given path
| -includeSnapshots | indicates a snapshottable directory or
| | there are snapshottable directories under it.
*------------------------+---------------------------------------------+
| -list-corruptfileblocks| Print out list of missing blocks and
| | files they belong to.
*------------------------+---------------------------------------------+
| -move | Move corrupted files to /lost+found.
*------------------------+---------------------------------------------+
| -openforwrite | Print out files opened for write.
*------------------------+---------------------------------------------+
| -showprogress | Print out dots for progress in output. Default is OFF
| | (no progress).
*------------------------+---------------------------------------------+
Runs the HDFS filesystem checking utility.
See {{{./HdfsUserGuide.html#fsck}fsck}} for more info.
** <<<getconf>>>
Usage:
---
hdfs getconf -namenodes
hdfs getconf -secondaryNameNodes
hdfs getconf -backupNodes
hdfs getconf -includeFile
hdfs getconf -excludeFile
hdfs getconf -nnRpcAddresses
hdfs getconf -confKey [key]
---
*------------------------+---------------------------------------------+
|| COMMAND_OPTION || Description
*------------------------+---------------------------------------------+
| -namenodes | gets list of namenodes in the cluster.
*------------------------+---------------------------------------------+
| -secondaryNameNodes | gets list of secondary namenodes in the cluster.
*------------------------+---------------------------------------------+
| -backupNodes | gets list of backup nodes in the cluster.
*------------------------+---------------------------------------------+
| -includeFile | gets the include file path that defines the datanodes that can join the cluster.
*------------------------+---------------------------------------------+
| -excludeFile | gets the exclude file path that defines the datanodes that need to decommissioned.
*------------------------+---------------------------------------------+
| -nnRpcAddresses | gets the namenode rpc addresses
*------------------------+---------------------------------------------+
| -confKey [key] | gets a specific key from the configuration
*------------------------+---------------------------------------------+
Gets configuration information from the configuration directory, post-processing.
** <<<groups>>>
Usage: <<<hdfs groups [username ...]>>>
Returns the group information given one or more usernames.
** <<<lsSnapshottableDir>>>
Usage: <<<hdfs lsSnapshottableDir [-help]>>>
*------------------------+---------------------------------------------+
|| COMMAND_OPTION || Description
*------------------------+---------------------------------------------+
| -help | print help
*------------------------+---------------------------------------------+
Get the list of snapshottable directories. When this is run as a super user,
it returns all snapshottable directories. Otherwise it returns those directories
that are owned by the current user.
** <<<jmxget>>>
Usage: <<<hdfs jmxget [-localVM ConnectorURL | -port port | -server mbeanserver | -service service]>>>
*------------------------+---------------------------------------------+
|| COMMAND_OPTION || Description
*------------------------+---------------------------------------------+
| -help | print help
*------------------------+---------------------------------------------+
| -localVM ConnectorURL | connect to the VM on the same machine
*------------------------+---------------------------------------------+
| -port <mbean server port> | specify mbean server port, if missing
| | it will try to connect to MBean Server in
| | the same VM
*------------------------+---------------------------------------------+
| -service | specify jmx service, either DataNode or NameNode, the default
*------------------------+---------------------------------------------+
Dump JMX information from a service.
** <<<oev>>>
Usage: <<<hdfs oev [OPTIONS] -i INPUT_FILE -o OUTPUT_FILE>>>
*** Required command line arguments:
*------------------------+---------------------------------------------+
|| COMMAND_OPTION || Description
*------------------------+---------------------------------------------+
|-i,--inputFile <arg> | edits file to process, xml (case
| insensitive) extension means XML format,
| any other filename means binary format
*------------------------+---------------------------------------------+
| -o,--outputFile <arg> | Name of output file. If the specified
| file exists, it will be overwritten,
| format of the file is determined
| by -p option
*------------------------+---------------------------------------------+
*** Optional command line arguments:
*------------------------+---------------------------------------------+
|| COMMAND_OPTION || Description
*------------------------+---------------------------------------------+
| -f,--fix-txids | Renumber the transaction IDs in the input,
| so that there are no gaps or invalid transaction IDs.
*------------------------+---------------------------------------------+
| -h,--help | Display usage information and exit
*------------------------+---------------------------------------------+
| -r,--recover | When reading binary edit logs, use recovery
| mode. This will give you the chance to skip
| corrupt parts of the edit log.
*------------------------+---------------------------------------------+
| -p,--processor <arg> | Select which type of processor to apply
| against image file, currently supported
| processors are: binary (native binary format
| that Hadoop uses), xml (default, XML
| format), stats (prints statistics about
| edits file)
*------------------------+---------------------------------------------+
| -v,--verbose | More verbose output, prints the input and
| output filenames, for processors that write
| to a file, also output to screen. On large
| image files this will dramatically increase
| processing time (default is false).
*------------------------+---------------------------------------------+
Hadoop offline edits viewer.
** <<<oiv>>>
Usage: <<<hdfs oiv [OPTIONS] -i INPUT_FILE>>>
*** Required command line arguments:
*------------------------+---------------------------------------------+
|| COMMAND_OPTION || Description
*------------------------+---------------------------------------------+
|-i,--inputFile <arg> | edits file to process, xml (case
| insensitive) extension means XML format,
| any other filename means binary format
*------------------------+---------------------------------------------+
*** Optional command line arguments:
*------------------------+---------------------------------------------+
|| COMMAND_OPTION || Description
*------------------------+---------------------------------------------+
| -h,--help | Display usage information and exit
*------------------------+---------------------------------------------+
| -o,--outputFile <arg> | Name of output file. If the specified
| file exists, it will be overwritten,
| format of the file is determined
| by -p option
*------------------------+---------------------------------------------+
| -p,--processor <arg> | Select which type of processor to apply
| against image file, currently supported
| processors are: binary (native binary format
| that Hadoop uses), xml (default, XML
| format), stats (prints statistics about
| edits file)
*------------------------+---------------------------------------------+
Hadoop Offline Image Viewer for newer image files.
** <<<oiv_legacy>>>
Usage: <<<hdfs oiv_legacy [OPTIONS] -i INPUT_FILE -o OUTPUT_FILE>>>
*------------------------+---------------------------------------------+
|| COMMAND_OPTION || Description
*------------------------+---------------------------------------------+
| -h,--help | Display usage information and exit
*------------------------+---------------------------------------------+
|-i,--inputFile <arg> | edits file to process, xml (case
| insensitive) extension means XML format,
| any other filename means binary format
*------------------------+---------------------------------------------+
| -o,--outputFile <arg> | Name of output file. If the specified
| file exists, it will be overwritten,
| format of the file is determined
| by -p option
*------------------------+---------------------------------------------+
Hadoop offline image viewer for older versions of Hadoop.
** <<<snapshotDiff>>>
Usage: <<<hdfs snapshotDiff <path> <fromSnapshot> <toSnapshot> >>>
Determine the difference between HDFS snapshots. See the
{{{./HdfsSnapshots.html#Get_Snapshots_Difference_Report}HDFS Snapshot Documentation}} for more information.
** <<<version>>>
Usage: <<<hdfs version>>>
Prints the version.
* Administration Commands
Commands useful for administrators of a hadoop cluster.
** <<<balancer>>>
Usage: <<<hdfs balancer [-threshold <threshold>] [-policy <policy>] [-idleiterations <idleiterations>]>>>
*------------------------+----------------------------------------------------+
|| COMMAND_OPTION | Description
*------------------------+----------------------------------------------------+
| -policy <policy> | <<<datanode>>> (default): Cluster is balanced if
| | each datanode is balanced. \
| | <<<blockpool>>>: Cluster is balanced if each block
| | pool in each datanode is balanced.
*------------------------+----------------------------------------------------+
| -threshold <threshold> | Percentage of disk capacity. This overwrites the
| | default threshold.
*------------------------+----------------------------------------------------+
| -idleiterations <iterations> | Maximum number of idle iterations before exit.
| | This overwrites the default idleiterations(5).
*------------------------+----------------------------------------------------+
Runs a cluster balancing utility. An administrator can simply press Ctrl-C
to stop the rebalancing process. See
{{{./HdfsUserGuide.html#Balancer}Balancer}} for more details.
Note that the <<<blockpool>>> policy is more strict than the <<<datanode>>>
policy.
** <<<cacheadmin>>>
Usage: <<<hdfs cacheadmin -addDirective -path <path> -pool <pool-name> [-force] [-replication <replication>] [-ttl <time-to-live>]>>>
See the {{{./CentralizedCacheManagement.html#cacheadmin_command-line_interface}HDFS Cache Administration Documentation}} for more information.
** <<<crypto>>>
Usage:
---
hdfs crypto -createZone -keyName <keyName> -path <path>
hdfs crypto -help <command-name>
hdfs crypto -listZones
---
See the {{{./TransparentEncryption.html#crypto_command-line_interface}HDFS Transparent Encryption Documentation}} for more information.
** <<<datanode>>>
Usage: <<<hdfs datanode [-regular | -rollback | -rollingupgrace rollback]>>>
*-----------------+-----------------------------------------------------------+
|| COMMAND_OPTION || Description
*-----------------+-----------------------------------------------------------+
| -regular | Normal datanode startup (default).
*-----------------+-----------------------------------------------------------+
| -rollback | Rollback the datanode to the previous version. This should
| | be used after stopping the datanode and distributing the
| | old hadoop version.
*-----------------+-----------------------------------------------------------+
| -rollingupgrade rollback | Rollback a rolling upgrade operation.
*-----------------+-----------------------------------------------------------+
Runs a HDFS datanode.
** <<<dfsadmin>>>
Usage:
------------------------------------------
hdfs dfsadmin [GENERIC_OPTIONS]
[-report [-live] [-dead] [-decommissioning]]
[-safemode enter | leave | get | wait]
[-saveNamespace]
[-rollEdits]
[-restoreFailedStorage true|false|check]
[-refreshNodes]
[-setQuota <quota> <dirname>...<dirname>]
[-clrQuota <dirname>...<dirname>]
[-setSpaceQuota <quota> <dirname>...<dirname>]
[-clrSpaceQuota <dirname>...<dirname>]
[-setStoragePolicy <path> <policyName>]
[-getStoragePolicy <path>]
[-finalizeUpgrade]
[-rollingUpgrade [<query>|<prepare>|<finalize>]]
[-metasave filename]
[-refreshServiceAcl]
[-refreshUserToGroupsMappings]
[-refreshSuperUserGroupsConfiguration]
[-refreshCallQueue]
[-refresh <host:ipc_port> <key> [arg1..argn]]
[-reconfig <datanode|...> <host:ipc_port> <start|status>]
[-printTopology]
[-refreshNamenodes datanodehost:port]
[-deleteBlockPool datanode-host:port blockpoolId [force]]
[-setBalancerBandwidth <bandwidth in bytes per second>]
[-allowSnapshot <snapshotDir>]
[-disallowSnapshot <snapshotDir>]
[-fetchImage <local directory>]
[-shutdownDatanode <datanode_host:ipc_port> [upgrade]]
[-getDatanodeInfo <datanode_host:ipc_port>]
[-triggerBlockReport [-incremental] <datanode_host:ipc_port>]
[-help [cmd]]
------------------------------------------
*-----------------+-----------------------------------------------------------+
|| COMMAND_OPTION || Description
*-----------------+-----------------------------------------------------------+
| -report [-live] [-dead] [-decommissioning] | Reports basic filesystem
| information and statistics. Optional flags may be used to
| filter the list of displayed DataNodes.
*-----------------+-----------------------------------------------------------+
| -safemode enter\|leave\|get\|wait | Safe mode maintenance command. Safe
| mode is a Namenode state in which it \
| 1. does not accept changes to the name space (read-only) \
| 2. does not replicate or delete blocks. \
| Safe mode is entered automatically at Namenode startup, and
| leaves safe mode automatically when the configured minimum
| percentage of blocks satisfies the minimum replication
| condition. Safe mode can also be entered manually, but then
| it can only be turned off manually as well.
*-----------------+-----------------------------------------------------------+
| -saveNamespace | Save current namespace into storage directories and reset
| edits log. Requires safe mode.
*-----------------+-----------------------------------------------------------+
| -rollEdits | Rolls the edit log on the active NameNode.
*-----------------+-----------------------------------------------------------+
| -restoreFailedStorage true\|false\|check | This option will turn on/off
| automatic attempt to restore failed storage replicas.
| If a failed storage becomes available again the system will
| attempt to restore edits and/or fsimage during checkpoint.
| 'check' option will return current setting.
*-----------------+-----------------------------------------------------------+
| -refreshNodes | Re-read the hosts and exclude files to update the set of
| Datanodes that are allowed to connect to the Namenode and
| those that should be decommissioned or recommissioned.
*-----------------+-----------------------------------------------------------+
| -setQuota \<quota\> \<dirname\>...\<dirname\> | See
| {{{../hadoop-hdfs/HdfsQuotaAdminGuide.html#Administrative_Commands}HDFS Quotas Guide}}
| for the detail.
*-----------------+-----------------------------------------------------------+
| -clrQuota \<dirname\>...\<dirname\> | See
| {{{../hadoop-hdfs/HdfsQuotaAdminGuide.html#Administrative_Commands}HDFS Quotas Guide}}
| for the detail.
*-----------------+-----------------------------------------------------------+
| -setSpaceQuota \<quota\> \<dirname\>...\<dirname\> | See
| {{{../hadoop-hdfs/HdfsQuotaAdminGuide.html#Administrative_Commands}HDFS Quotas Guide}}
| for the detail.
*-----------------+-----------------------------------------------------------+
| -clrSpaceQuota \<dirname\>...\<dirname\> | See
| {{{../hadoop-hdfs/HdfsQuotaAdminGuide.html#Administrative_Commands}HDFS Quotas Guide}}
| for the detail.
*-----------------+-----------------------------------------------------------+
| -setStoragePolicy \<path\> \<policyName\> | Set a storage policy to a file or a directory.
*-----------------+-----------------------------------------------------------+
| -getStoragePolicy \<path\> | Get the storage policy of a file or a directory.
*-----------------+-----------------------------------------------------------+
| -finalizeUpgrade| Finalize upgrade of HDFS. Datanodes delete their previous
| version working directories, followed by Namenode doing the
| same. This completes the upgrade process.
*-----------------+-----------------------------------------------------------+
| -rollingUpgrade [\<query\>\|\<prepare\>\|\<finalize\>] | See
| {{{../hadoop-hdfs/HdfsRollingUpgrade.html#dfsadmin_-rollingUpgrade}Rolling Upgrade document}}
| for the detail.
*-----------------+-----------------------------------------------------------+
| -metasave filename | Save Namenode's primary data structures to <filename> in
| the directory specified by hadoop.log.dir property.
| <filename> is overwritten if it exists.
| <filename> will contain one line for each of the following\
| 1. Datanodes heart beating with Namenode\
| 2. Blocks waiting to be replicated\
| 3. Blocks currently being replicated\
| 4. Blocks waiting to be deleted
*-----------------+-----------------------------------------------------------+
| -refreshServiceAcl | Reload the service-level authorization policy file.
*-----------------+-----------------------------------------------------------+
| -refreshUserToGroupsMappings | Refresh user-to-groups mappings.
*-----------------+-----------------------------------------------------------+
| -refreshSuperUserGroupsConfiguration |Refresh superuser proxy groups mappings
*-----------------+-----------------------------------------------------------+
| -refreshCallQueue | Reload the call queue from config.
*-----------------+-----------------------------------------------------------+
| -refresh \<host:ipc_port\> \<key\> [arg1..argn] | Triggers a runtime-refresh
| of the resource specified by \<key\> on \<host:ipc_port\>.
| All other args after are sent to the host.
*-----------------+-----------------------------------------------------------+
| -reconfig \<datanode\|...\> \<host:ipc_port\> \<start\|status\> | Start
| reconfiguration or get the status of an ongoing
| reconfiguration. The second parameter specifies the node
| type. Currently, only reloading DataNode's configuration is
| supported.
*-----------------+-----------------------------------------------------------+
| -printTopology | Print a tree of the racks and their nodes as reported by
| the Namenode
*-----------------+-----------------------------------------------------------+
| -refreshNamenodes datanodehost:port | For the given datanode, reloads the
| configuration files, stops serving the removed block-pools
| and starts serving new block-pools.
*-----------------+-----------------------------------------------------------+
| -deleteBlockPool datanode-host:port blockpoolId [force] | If force is passed,
| block pool directory for the given blockpool id on the
| given datanode is deleted along with its contents,
| otherwise the directory is deleted only if it is empty.
| The command will fail if datanode is still serving the
| block pool. Refer to refreshNamenodes to shutdown a block
| pool service on a datanode.
*-----------------+-----------------------------------------------------------+
| -setBalancerBandwidth \<bandwidth in bytes per second\> | Changes the network
| bandwidth used by each datanode during HDFS block
| balancing. \<bandwidth\> is the maximum number of bytes per
| second that will be used by each datanode. This value
| overrides the dfs.balance.bandwidthPerSec parameter.\
| NOTE: The new value is not persistent on the DataNode.
*-----------------+-----------------------------------------------------------+
| -allowSnapshot \<snapshotDir\> | Allowing snapshots of a directory to be
| created. If the operation completes successfully, the
| directory becomes snapshottable. See the {{{./HdfsSnapshots.html}HDFS Snapshot Documentation}} for more information.
*-----------------+-----------------------------------------------------------+
| -disallowSnapshot \<snapshotDir\> | Disallowing snapshots of a directory to
| be created. All snapshots of the directory must be deleted
| before disallowing snapshots. See the {{{./HdfsSnapshots.html}HDFS Snapshot Documentation}} for more information.
*-----------------+-----------------------------------------------------------+
| -fetchImage \<local directory\> | Downloads the most recent fsimage from the
| NameNode and saves it in the specified local directory.
*-----------------+-----------------------------------------------------------+
| -shutdownDatanode \<datanode_host:ipc_port\> [upgrade] | Submit a shutdown
| request for the given datanode. See
| {{{./HdfsRollingUpgrade.html#dfsadmin_-shutdownDatanode}Rolling Upgrade document}}
| for the detail.
*-----------------+-----------------------------------------------------------+
| -getDatanodeInfo \<datanode_host:ipc_port\> | Get the information about the
| given datanode. See
| {{{./HdfsRollingUpgrade.html#dfsadmin_-getDatanodeInfo}Rolling Upgrade document}}
| for the detail.
*-----------------+-----------------------------------------------------------+
| -triggerBlockReport [-incremental] \<datanode_host:ipc_port\> | Trigger a
| block report for the given datanode. If 'incremental' is
| specified, it will be | an incremental block report;
| otherwise, it will be a full block report.
*-----------------+-----------------------------------------------------------+
| -help [cmd] | Displays help for the given command or all commands if none
| is specified.
*-----------------+-----------------------------------------------------------+
Runs a HDFS dfsadmin client.
** <<<haadmin>>>
Usage:
---
hdfs haadmin -checkHealth <serviceId>
hdfs haadmin -failover [--forcefence] [--forceactive] <serviceId> <serviceId>
hdfs haadmin -getServiceState <serviceId>
hdfs haadmin -help <command>
hdfs haadmin -transitionToActive <serviceId> [--forceactive]
hdfs haadmin -transitionToStandby <serviceId>
---
*--------------------+--------------------------------------------------------+
|| COMMAND_OPTION || Description
*--------------------+--------------------------------------------------------+
| -checkHealth | check the health of the given NameNode
*--------------------+--------------------------------------------------------+
| -failover | initiate a failover between two NameNodes
*--------------------+--------------------------------------------------------+
| -getServiceState | determine whether the given NameNode is Active or Standby
*--------------------+--------------------------------------------------------+
| -transitionToActive | transition the state of the given NameNode to Active (Warning: No fencing is done)
*--------------------+--------------------------------------------------------+
| -transitionToStandby | transition the state of the given NameNode to Standby (Warning: No fencing is done)
*--------------------+--------------------------------------------------------+
See {{{./HDFSHighAvailabilityWithNFS.html#Administrative_commands}HDFS HA with NFS}} or
{{{./HDFSHighAvailabilityWithQJM.html#Administrative_commands}HDFS HA with QJM}} for more
information on this command.
** <<<journalnode>>>
Usage: <<<hdfs journalnode>>>
This comamnd starts a journalnode for use with {{{./HDFSHighAvailabilityWithQJM.html#Administrative_commands}HDFS HA with QJM}}.
** <<<mover>>>
Usage: <<<hdfs mover [-p <files/dirs> | -f <local file name>]>>>
*--------------------+--------------------------------------------------------+
|| COMMAND_OPTION || Description
*--------------------+--------------------------------------------------------+
| -f \<local file\> | Specify a local file containing a list of HDFS files/dirs to migrate.
*--------------------+--------------------------------------------------------+
| -p \<files/dirs\> | Specify a space separated list of HDFS files/dirs to migrate.
*--------------------+--------------------------------------------------------+
Runs the data migration utility.
See {{{./ArchivalStorage.html#Mover_-_A_New_Data_Migration_Tool}Mover}} for more details.
Note that, when both -p and -f options are omitted, the default path is the root directory.
** <<<namenode>>>
Usage:
------------------------------------------
hdfs namenode [-backup] |
[-checkpoint] |
[-format [-clusterid cid ] [-force] [-nonInteractive] ] |
[-upgrade [-clusterid cid] [-renameReserved<k-v pairs>] ] |
[-upgradeOnly [-clusterid cid] [-renameReserved<k-v pairs>] ] |
[-rollback] |
[-rollingUpgrade <downgrade|rollback> ] |
[-finalize] |
[-importCheckpoint] |
[-initializeSharedEdits] |
[-bootstrapStandby] |
[-recover [-force] ] |
[-metadataVersion ]
------------------------------------------
*--------------------+--------------------------------------------------------+
|| COMMAND_OPTION || Description
*--------------------+--------------------------------------------------------+
| -backup | Start backup node.
*--------------------+--------------------------------------------------------+
| -checkpoint | Start checkpoint node.
*--------------------+--------------------------------------------------------+
| -format [-clusterid cid] [-force] [-nonInteractive] | Formats the specified
| NameNode. It starts the NameNode, formats it and then
| shut it down. -force option formats if the name
| directory exists. -nonInteractive option aborts if the
| name directory exists, unless -force option is specified.
*--------------------+--------------------------------------------------------+
| -upgrade [-clusterid cid] [-renameReserved\<k-v pairs\>] | Namenode should be
| started with upgrade option after
| the distribution of new Hadoop version.
*--------------------+--------------------------------------------------------+
| -upgradeOnly [-clusterid cid] [-renameReserved\<k-v pairs\>] | Upgrade the
| specified NameNode and then shutdown it.
*--------------------+--------------------------------------------------------+
| -rollback | Rollback the NameNode to the previous version. This
| should be used after stopping the cluster and
| distributing the old Hadoop version.
*--------------------+--------------------------------------------------------+
| -rollingUpgrade \<downgrade\|rollback\|started\> | See
| {{{./HdfsRollingUpgrade.html#NameNode_Startup_Options}Rolling Upgrade document}}
| for the detail.
*--------------------+--------------------------------------------------------+
| -finalize | Finalize will remove the previous state of the files
| system. Recent upgrade will become permanent. Rollback
| option will not be available anymore. After finalization
| it shuts the NameNode down.
*--------------------+--------------------------------------------------------+
| -importCheckpoint | Loads image from a checkpoint directory and save it
| into the current one. Checkpoint dir is read from
| property fs.checkpoint.dir
*--------------------+--------------------------------------------------------+
| -initializeSharedEdits | Format a new shared edits dir and copy in enough
| edit log segments so that the standby NameNode can start
| up.
*--------------------+--------------------------------------------------------+
| -bootstrapStandby | Allows the standby NameNode's storage directories to be
| bootstrapped by copying the latest namespace snapshot
| from the active NameNode. This is used when first
| configuring an HA cluster.
*--------------------+--------------------------------------------------------+
| -recover [-force] | Recover lost metadata on a corrupt filesystem. See
| {{{./HdfsUserGuide.html#Recovery_Mode}HDFS User Guide}}
| for the detail.
*--------------------+--------------------------------------------------------+
| -metadataVersion | Verify that configured directories exist, then print the
| metadata versions of the software and the image.
*--------------------+--------------------------------------------------------+
Runs the namenode. More info about the upgrade, rollback and finalize is at
{{{./HdfsUserGuide.html#Upgrade_and_Rollback}Upgrade Rollback}}.
** <<<nfs3>>>
Usage: <<<hdfs nfs3>>>
This comamnd starts the NFS3 gateway for use with the {{{./HdfsNfsGateway.html#Start_and_stop_NFS_gateway_service}HDFS NFS3 Service}}.
** <<<portmap>>>
Usage: <<<hdfs portmap>>>
This comamnd starts the RPC portmap for use with the {{{./HdfsNfsGateway.html#Start_and_stop_NFS_gateway_service}HDFS NFS3 Service}}.
** <<<secondarynamenode>>>
Usage: <<<hdfs secondarynamenode [-checkpoint [force]] | [-format] |
[-geteditsize]>>>
*----------------------+------------------------------------------------------+
|| COMMAND_OPTION || Description
*----------------------+------------------------------------------------------+
| -checkpoint [force] | Checkpoints the SecondaryNameNode if EditLog size
| >= fs.checkpoint.size. If <<<force>>> is used,
| checkpoint irrespective of EditLog size.
*----------------------+------------------------------------------------------+
| -format | Format the local storage during startup.
*----------------------+------------------------------------------------------+
| -geteditsize | Prints the number of uncheckpointed transactions on
| the NameNode.
*----------------------+------------------------------------------------------+
Runs the HDFS secondary namenode.
See {{{./HdfsUserGuide.html#Secondary_NameNode}Secondary Namenode}}
for more info.
** <<<storagepolicies>>>
Usage: <<<hdfs storagepolicies>>>
Lists out all storage policies. See the {{{./ArchivalStorage.html}HDFS Storage Policy Documentation}} for more information.
** <<<zkfc>>>
Usage: <<<hdfs zkfc [-formatZK [-force] [-nonInteractive]]>>>
*----------------------+------------------------------------------------------+
|| COMMAND_OPTION || Description
*----------------------+------------------------------------------------------+
| -formatZK | Format the Zookeeper instance
*----------------------+------------------------------------------------------+
| -h | Display help
*----------------------+------------------------------------------------------+
This comamnd starts a Zookeeper Failover Controller process for use with {{{./HDFSHighAvailabilityWithQJM.html#Administrative_commands}HDFS HA with QJM}}.
* Debug Commands
Useful commands to help administrators debug HDFS issues, like validating
block files and calling recoverLease.
** <<<verify>>>
Usage: <<<hdfs dfs verify [-meta <metadata-file>] [-block <block-file>]>>>
*------------------------+----------------------------------------------------+
|| COMMAND_OPTION | Description
*------------------------+----------------------------------------------------+
| -block <block-file> | Optional parameter to specify the absolute path for
| | the block file on the local file system of the data
| | node.
*------------------------+----------------------------------------------------+
| -meta <metadata-file> | Absolute path for the metadata file on the local file
| | system of the data node.
*------------------------+----------------------------------------------------+
Verify HDFS metadata and block files. If a block file is specified, we
will verify that the checksums in the metadata file match the block
file.
** <<<recoverLease>>>
Usage: <<<hdfs dfs recoverLease [-path <path>] [-retries <num-retries>]>>>
*-------------------------------+--------------------------------------------+
|| COMMAND_OPTION || Description
*-------------------------------+---------------------------------------------+
| [-path <path>] | HDFS path for which to recover the lease.
*-------------------------------+---------------------------------------------+
| [-retries <num-retries>] | Number of times the client will retry calling
| | recoverLease. The default number of retries
| | is 1.
*-------------------------------+---------------------------------------------+
Recover the lease on the specified path. The path must reside on an
HDFS filesystem. The default number of retries is 1.

View File

@ -1,859 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
Hadoop Distributed File System-${project.version} - High Availability
---
---
${maven.build.timestamp}
HDFS High Availability
%{toc|section=1|fromDepth=0}
* {Purpose}
This guide provides an overview of the HDFS High Availability (HA) feature and
how to configure and manage an HA HDFS cluster, using NFS for the shared
storage required by the NameNodes.
This document assumes that the reader has a general understanding of
general components and node types in an HDFS cluster. Please refer to the
HDFS Architecture guide for details.
* {Note: Using the Quorum Journal Manager or Conventional Shared Storage}
This guide discusses how to configure and use HDFS HA using a shared NFS
directory to share edit logs between the Active and Standby NameNodes. For
information on how to configure HDFS HA using the Quorum Journal Manager
instead of NFS, please see {{{./HDFSHighAvailabilityWithQJM.html}this
alternative guide.}}
* {Background}
Prior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in
an HDFS cluster. Each cluster had a single NameNode, and if that machine or
process became unavailable, the cluster as a whole would be unavailable
until the NameNode was either restarted or brought up on a separate machine.
This impacted the total availability of the HDFS cluster in two major ways:
* In the case of an unplanned event such as a machine crash, the cluster would
be unavailable until an operator restarted the NameNode.
* Planned maintenance events such as software or hardware upgrades on the
NameNode machine would result in windows of cluster downtime.
The HDFS High Availability feature addresses the above problems by providing
the option of running two redundant NameNodes in the same cluster in an
Active/Passive configuration with a hot standby. This allows a fast failover to
a new NameNode in the case that a machine crashes, or a graceful
administrator-initiated failover for the purpose of planned maintenance.
* {Architecture}
In a typical HA cluster, two separate machines are configured as NameNodes.
At any point in time, exactly one of the NameNodes is in an <Active> state,
and the other is in a <Standby> state. The Active NameNode is responsible
for all client operations in the cluster, while the Standby is simply acting
as a slave, maintaining enough state to provide a fast failover if
necessary.
In order for the Standby node to keep its state synchronized with the Active
node, the current implementation requires that the two nodes both have access
to a directory on a shared storage device (eg an NFS mount from a NAS). This
restriction will likely be relaxed in future versions.
When any namespace modification is performed by the Active node, it durably
logs a record of the modification to an edit log file stored in the shared
directory. The Standby node is constantly watching this directory for edits,
and as it sees the edits, it applies them to its own namespace. In the event of
a failover, the Standby will ensure that it has read all of the edits from the
shared storage before promoting itself to the Active state. This ensures that
the namespace state is fully synchronized before a failover occurs.
In order to provide a fast failover, it is also necessary that the Standby node
have up-to-date information regarding the location of blocks in the cluster.
In order to achieve this, the DataNodes are configured with the location of
both NameNodes, and send block location information and heartbeats to both.
It is vital for the correct operation of an HA cluster that only one of the
NameNodes be Active at a time. Otherwise, the namespace state would quickly
diverge between the two, risking data loss or other incorrect results. In
order to ensure this property and prevent the so-called "split-brain scenario,"
the administrator must configure at least one <fencing method> for the shared
storage. During a failover, if it cannot be verified that the previous Active
node has relinquished its Active state, the fencing process is responsible for
cutting off the previous Active's access to the shared edits storage. This
prevents it from making any further edits to the namespace, allowing the new
Active to safely proceed with failover.
* {Hardware resources}
In order to deploy an HA cluster, you should prepare the following:
* <<NameNode machines>> - the machines on which you run the Active and
Standby NameNodes should have equivalent hardware to each other, and
equivalent hardware to what would be used in a non-HA cluster.
* <<Shared storage>> - you will need to have a shared directory which both
NameNode machines can have read/write access to. Typically this is a remote
filer which supports NFS and is mounted on each of the NameNode machines.
Currently only a single shared edits directory is supported. Thus, the
availability of the system is limited by the availability of this shared edits
directory, and therefore in order to remove all single points of failure there
needs to be redundancy for the shared edits directory. Specifically, multiple
network paths to the storage, and redundancy in the storage itself (disk,
network, and power). Beacuse of this, it is recommended that the shared storage
server be a high-quality dedicated NAS appliance rather than a simple Linux
server.
Note that, in an HA cluster, the Standby NameNode also performs checkpoints of
the namespace state, and thus it is not necessary to run a Secondary NameNode,
CheckpointNode, or BackupNode in an HA cluster. In fact, to do so would be an
error. This also allows one who is reconfiguring a non-HA-enabled HDFS cluster
to be HA-enabled to reuse the hardware which they had previously dedicated to
the Secondary NameNode.
* {Deployment}
** Configuration overview
Similar to Federation configuration, HA configuration is backward compatible
and allows existing single NameNode configurations to work without change.
The new configuration is designed such that all the nodes in the cluster may
have the same configuration without the need for deploying different
configuration files to different machines based on the type of the node.
Like HDFS Federation, HA clusters reuse the <<<nameservice ID>>> to identify a
single HDFS instance that may in fact consist of multiple HA NameNodes. In
addition, a new abstraction called <<<NameNode ID>>> is added with HA. Each
distinct NameNode in the cluster has a different NameNode ID to distinguish it.
To support a single configuration file for all of the NameNodes, the relevant
configuration parameters are suffixed with the <<nameservice ID>> as well as
the <<NameNode ID>>.
** Configuration details
To configure HA NameNodes, you must add several configuration options to your
<<hdfs-site.xml>> configuration file.
The order in which you set these configurations is unimportant, but the values
you choose for <<dfs.nameservices>> and
<<dfs.ha.namenodes.[nameservice ID]>> will determine the keys of those that
follow. Thus, you should decide on these values before setting the rest of the
configuration options.
* <<dfs.nameservices>> - the logical name for this new nameservice
Choose a logical name for this nameservice, for example "mycluster", and use
this logical name for the value of this config option. The name you choose is
arbitrary. It will be used both for configuration and as the authority
component of absolute HDFS paths in the cluster.
<<Note:>> If you are also using HDFS Federation, this configuration setting
should also include the list of other nameservices, HA or otherwise, as a
comma-separated list.
----
<property>
<name>dfs.nameservices</name>
<value>mycluster</value>
</property>
----
* <<dfs.ha.namenodes.[nameservice ID]>> - unique identifiers for each NameNode in the nameservice
Configure with a list of comma-separated NameNode IDs. This will be used by
DataNodes to determine all the NameNodes in the cluster. For example, if you
used "mycluster" as the nameservice ID previously, and you wanted to use "nn1"
and "nn2" as the individual IDs of the NameNodes, you would configure this as
such:
----
<property>
<name>dfs.ha.namenodes.mycluster</name>
<value>nn1,nn2</value>
</property>
----
<<Note:>> Currently, only a maximum of two NameNodes may be configured per
nameservice.
* <<dfs.namenode.rpc-address.[nameservice ID].[name node ID]>> - the fully-qualified RPC address for each NameNode to listen on
For both of the previously-configured NameNode IDs, set the full address and
IPC port of the NameNode processs. Note that this results in two separate
configuration options. For example:
----
<property>
<name>dfs.namenode.rpc-address.mycluster.nn1</name>
<value>machine1.example.com:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.mycluster.nn2</name>
<value>machine2.example.com:8020</value>
</property>
----
<<Note:>> You may similarly configure the "<<servicerpc-address>>" setting if
you so desire.
* <<dfs.namenode.http-address.[nameservice ID].[name node ID]>> - the fully-qualified HTTP address for each NameNode to listen on
Similarly to <rpc-address> above, set the addresses for both NameNodes' HTTP
servers to listen on. For example:
----
<property>
<name>dfs.namenode.http-address.mycluster.nn1</name>
<value>machine1.example.com:50070</value>
</property>
<property>
<name>dfs.namenode.http-address.mycluster.nn2</name>
<value>machine2.example.com:50070</value>
</property>
----
<<Note:>> If you have Hadoop's security features enabled, you should also set
the <https-address> similarly for each NameNode.
* <<dfs.namenode.shared.edits.dir>> - the location of the shared storage directory
This is where one configures the path to the remote shared edits directory
which the Standby NameNode uses to stay up-to-date with all the file system
changes the Active NameNode makes. <<You should only configure one of these
directories.>> This directory should be mounted r/w on both NameNode machines.
The value of this setting should be the absolute path to this directory on the
NameNode machines. For example:
----
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>file:///mnt/filer1/dfs/ha-name-dir-shared</value>
</property>
----
* <<dfs.client.failover.proxy.provider.[nameservice ID]>> - the Java class that HDFS clients use to contact the Active NameNode
Configure the name of the Java class which will be used by the DFS Client to
determine which NameNode is the current Active, and therefore which NameNode is
currently serving client requests. The only implementation which currently
ships with Hadoop is the <<ConfiguredFailoverProxyProvider>>, so use this
unless you are using a custom one. For example:
----
<property>
<name>dfs.client.failover.proxy.provider.mycluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
----
* <<dfs.ha.fencing.methods>> - a list of scripts or Java classes which will be used to fence the Active NameNode during a failover
It is critical for correctness of the system that only one NameNode be in the
Active state at any given time. Thus, during a failover, we first ensure that
the Active NameNode is either in the Standby state, or the process has
terminated, before transitioning the other NameNode to the Active state. In
order to do this, you must configure at least one <<fencing method.>> These are
configured as a carriage-return-separated list, which will be attempted in order
until one indicates that fencing has succeeded. There are two methods which
ship with Hadoop: <shell> and <sshfence>. For information on implementing
your own custom fencing method, see the <org.apache.hadoop.ha.NodeFencer> class.
* <<sshfence>> - SSH to the Active NameNode and kill the process
The <sshfence> option SSHes to the target node and uses <fuser> to kill the
process listening on the service's TCP port. In order for this fencing option
to work, it must be able to SSH to the target node without providing a
passphrase. Thus, one must also configure the
<<dfs.ha.fencing.ssh.private-key-files>> option, which is a
comma-separated list of SSH private key files. For example:
---
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/home/exampleuser/.ssh/id_rsa</value>
</property>
---
Optionally, one may configure a non-standard username or port to perform the
SSH. One may also configure a timeout, in milliseconds, for the SSH, after
which this fencing method will be considered to have failed. It may be
configured like so:
---
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence([[username][:port]])</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.connect-timeout</name>
<value>30000</value>
</property>
---
* <<shell>> - run an arbitrary shell command to fence the Active NameNode
The <shell> fencing method runs an arbitrary shell command. It may be
configured like so:
---
<property>
<name>dfs.ha.fencing.methods</name>
<value>shell(/path/to/my/script.sh arg1 arg2 ...)</value>
</property>
---
The string between '(' and ')' is passed directly to a bash shell and may not
include any closing parentheses.
The shell command will be run with an environment set up to contain all of the
current Hadoop configuration variables, with the '_' character replacing any
'.' characters in the configuration keys. The configuration used has already had
any namenode-specific configurations promoted to their generic forms -- for example
<<dfs_namenode_rpc-address>> will contain the RPC address of the target node, even
though the configuration may specify that variable as
<<dfs.namenode.rpc-address.ns1.nn1>>.
Additionally, the following variables referring to the target node to be fenced
are also available:
*-----------------------:-----------------------------------+
| $target_host | hostname of the node to be fenced |
*-----------------------:-----------------------------------+
| $target_port | IPC port of the node to be fenced |
*-----------------------:-----------------------------------+
| $target_address | the above two, combined as host:port |
*-----------------------:-----------------------------------+
| $target_nameserviceid | the nameservice ID of the NN to be fenced |
*-----------------------:-----------------------------------+
| $target_namenodeid | the namenode ID of the NN to be fenced |
*-----------------------:-----------------------------------+
These environment variables may also be used as substitutions in the shell
command itself. For example:
---
<property>
<name>dfs.ha.fencing.methods</name>
<value>shell(/path/to/my/script.sh --nameservice=$target_nameserviceid $target_host:$target_port)</value>
</property>
---
If the shell command returns an exit
code of 0, the fencing is determined to be successful. If it returns any other
exit code, the fencing was not successful and the next fencing method in the
list will be attempted.
<<Note:>> This fencing method does not implement any timeout. If timeouts are
necessary, they should be implemented in the shell script itself (eg by forking
a subshell to kill its parent in some number of seconds).
* <<fs.defaultFS>> - the default path prefix used by the Hadoop FS client when none is given
Optionally, you may now configure the default path for Hadoop clients to use
the new HA-enabled logical URI. If you used "mycluster" as the nameservice ID
earlier, this will be the value of the authority portion of all of your HDFS
paths. This may be configured like so, in your <<core-site.xml>> file:
---
<property>
<name>fs.defaultFS</name>
<value>hdfs://mycluster</value>
</property>
---
** Deployment details
After all of the necessary configuration options have been set, one must
initially synchronize the two HA NameNodes' on-disk metadata.
* If you are setting up a fresh HDFS cluster, you should first run the format
command (<hdfs namenode -format>) on one of NameNodes.
* If you have already formatted the NameNode, or are converting a
non-HA-enabled cluster to be HA-enabled, you should now copy over the
contents of your NameNode metadata directories to the other, unformatted
NameNode by running the command "<hdfs namenode -bootstrapStandby>" on the
unformatted NameNode. Running this command will also ensure that the shared
edits directory (as configured by <<dfs.namenode.shared.edits.dir>>) contains
sufficient edits transactions to be able to start both NameNodes.
* If you are converting a non-HA NameNode to be HA, you should run the
command "<hdfs -initializeSharedEdits>", which will initialize the shared
edits directory with the edits data from the local NameNode edits directories.
At this point you may start both of your HA NameNodes as you normally would
start a NameNode.
You can visit each of the NameNodes' web pages separately by browsing to their
configured HTTP addresses. You should notice that next to the configured
address will be the HA state of the NameNode (either "standby" or "active".)
Whenever an HA NameNode starts, it is initially in the Standby state.
** Administrative commands
Now that your HA NameNodes are configured and started, you will have access
to some additional commands to administer your HA HDFS cluster. Specifically,
you should familiarize yourself with all of the subcommands of the "<hdfs
haadmin>" command. Running this command without any additional arguments will
display the following usage information:
---
Usage: DFSHAAdmin [-ns <nameserviceId>]
[-transitionToActive <serviceId>]
[-transitionToStandby <serviceId>]
[-failover [--forcefence] [--forceactive] <serviceId> <serviceId>]
[-getServiceState <serviceId>]
[-checkHealth <serviceId>]
[-help <command>]
---
This guide describes high-level uses of each of these subcommands. For
specific usage information of each subcommand, you should run "<hdfs haadmin
-help <command>>".
* <<transitionToActive>> and <<transitionToStandby>> - transition the state of the given NameNode to Active or Standby
These subcommands cause a given NameNode to transition to the Active or Standby
state, respectively. <<These commands do not attempt to perform any fencing,
and thus should rarely be used.>> Instead, one should almost always prefer to
use the "<hdfs haadmin -failover>" subcommand.
* <<failover>> - initiate a failover between two NameNodes
This subcommand causes a failover from the first provided NameNode to the
second. If the first NameNode is in the Standby state, this command simply
transitions the second to the Active state without error. If the first NameNode
is in the Active state, an attempt will be made to gracefully transition it to
the Standby state. If this fails, the fencing methods (as configured by
<<dfs.ha.fencing.methods>>) will be attempted in order until one
succeeds. Only after this process will the second NameNode be transitioned to
the Active state. If no fencing method succeeds, the second NameNode will not
be transitioned to the Active state, and an error will be returned.
* <<getServiceState>> - determine whether the given NameNode is Active or Standby
Connect to the provided NameNode to determine its current state, printing
either "standby" or "active" to STDOUT appropriately. This subcommand might be
used by cron jobs or monitoring scripts which need to behave differently based
on whether the NameNode is currently Active or Standby.
* <<checkHealth>> - check the health of the given NameNode
Connect to the provided NameNode to check its health. The NameNode is capable
of performing some diagnostics on itself, including checking if internal
services are running as expected. This command will return 0 if the NameNode is
healthy, non-zero otherwise. One might use this command for monitoring
purposes.
<<Note:>> This is not yet implemented, and at present will always return
success, unless the given NameNode is completely down.
* {Automatic Failover}
** Introduction
The above sections describe how to configure manual failover. In that mode,
the system will not automatically trigger a failover from the active to the
standby NameNode, even if the active node has failed. This section describes
how to configure and deploy automatic failover.
** Components
Automatic failover adds two new components to an HDFS deployment: a ZooKeeper
quorum, and the ZKFailoverController process (abbreviated as ZKFC).
Apache ZooKeeper is a highly available service for maintaining small amounts
of coordination data, notifying clients of changes in that data, and
monitoring clients for failures. The implementation of automatic HDFS failover
relies on ZooKeeper for the following things:
* <<Failure detection>> - each of the NameNode machines in the cluster
maintains a persistent session in ZooKeeper. If the machine crashes, the
ZooKeeper session will expire, notifying the other NameNode that a failover
should be triggered.
* <<Active NameNode election>> - ZooKeeper provides a simple mechanism to
exclusively elect a node as active. If the current active NameNode crashes,
another node may take a special exclusive lock in ZooKeeper indicating that
it should become the next active.
The ZKFailoverController (ZKFC) is a new component which is a ZooKeeper client
which also monitors and manages the state of the NameNode. Each of the
machines which runs a NameNode also runs a ZKFC, and that ZKFC is responsible
for:
* <<Health monitoring>> - the ZKFC pings its local NameNode on a periodic
basis with a health-check command. So long as the NameNode responds in a
timely fashion with a healthy status, the ZKFC considers the node
healthy. If the node has crashed, frozen, or otherwise entered an unhealthy
state, the health monitor will mark it as unhealthy.
* <<ZooKeeper session management>> - when the local NameNode is healthy, the
ZKFC holds a session open in ZooKeeper. If the local NameNode is active, it
also holds a special "lock" znode. This lock uses ZooKeeper's support for
"ephemeral" nodes; if the session expires, the lock node will be
automatically deleted.
* <<ZooKeeper-based election>> - if the local NameNode is healthy, and the
ZKFC sees that no other node currently holds the lock znode, it will itself
try to acquire the lock. If it succeeds, then it has "won the election", and
is responsible for running a failover to make its local NameNode active. The
failover process is similar to the manual failover described above: first,
the previous active is fenced if necessary, and then the local NameNode
transitions to active state.
For more details on the design of automatic failover, refer to the design
document attached to HDFS-2185 on the Apache HDFS JIRA.
** Deploying ZooKeeper
In a typical deployment, ZooKeeper daemons are configured to run on three or
five nodes. Since ZooKeeper itself has light resource requirements, it is
acceptable to collocate the ZooKeeper nodes on the same hardware as the HDFS
NameNode and Standby Node. Many operators choose to deploy the third ZooKeeper
process on the same node as the YARN ResourceManager. It is advisable to
configure the ZooKeeper nodes to store their data on separate disk drives from
the HDFS metadata for best performance and isolation.
The setup of ZooKeeper is out of scope for this document. We will assume that
you have set up a ZooKeeper cluster running on three or more nodes, and have
verified its correct operation by connecting using the ZK CLI.
** Before you begin
Before you begin configuring automatic failover, you should shut down your
cluster. It is not currently possible to transition from a manual failover
setup to an automatic failover setup while the cluster is running.
** Configuring automatic failover
The configuration of automatic failover requires the addition of two new
parameters to your configuration. In your <<<hdfs-site.xml>>> file, add:
----
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
----
This specifies that the cluster should be set up for automatic failover.
In your <<<core-site.xml>>> file, add:
----
<property>
<name>ha.zookeeper.quorum</name>
<value>zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181</value>
</property>
----
This lists the host-port pairs running the ZooKeeper service.
As with the parameters described earlier in the document, these settings may
be configured on a per-nameservice basis by suffixing the configuration key
with the nameservice ID. For example, in a cluster with federation enabled,
you can explicitly enable automatic failover for only one of the nameservices
by setting <<<dfs.ha.automatic-failover.enabled.my-nameservice-id>>>.
There are also several other configuration parameters which may be set to
control the behavior of automatic failover; however, they are not necessary
for most installations. Please refer to the configuration key specific
documentation for details.
** Initializing HA state in ZooKeeper
After the configuration keys have been added, the next step is to initialize
required state in ZooKeeper. You can do so by running the following command
from one of the NameNode hosts.
----
[hdfs]$ $HADOOP_PREFIX/bin/zkfc -formatZK
----
This will create a znode in ZooKeeper inside of which the automatic failover
system stores its data.
** Starting the cluster with <<<start-dfs.sh>>>
Since automatic failover has been enabled in the configuration, the
<<<start-dfs.sh>>> script will now automatically start a ZKFC daemon on any
machine that runs a NameNode. When the ZKFCs start, they will automatically
select one of the NameNodes to become active.
** Starting the cluster manually
If you manually manage the services on your cluster, you will need to manually
start the <<<zkfc>>> daemon on each of the machines that runs a NameNode. You
can start the daemon by running:
----
[hdfs]$ $HADOOP_PREFIX/bin/hdfs --daemon start zkfc
----
** Securing access to ZooKeeper
If you are running a secure cluster, you will likely want to ensure that the
information stored in ZooKeeper is also secured. This prevents malicious
clients from modifying the metadata in ZooKeeper or potentially triggering a
false failover.
In order to secure the information in ZooKeeper, first add the following to
your <<<core-site.xml>>> file:
----
<property>
<name>ha.zookeeper.auth</name>
<value>@/path/to/zk-auth.txt</value>
</property>
<property>
<name>ha.zookeeper.acl</name>
<value>@/path/to/zk-acl.txt</value>
</property>
----
Please note the '@' character in these values -- this specifies that the
configurations are not inline, but rather point to a file on disk.
The first configured file specifies a list of ZooKeeper authentications, in
the same format as used by the ZK CLI. For example, you may specify something
like:
----
digest:hdfs-zkfcs:mypassword
----
...where <<<hdfs-zkfcs>>> is a unique username for ZooKeeper, and
<<<mypassword>>> is some unique string used as a password.
Next, generate a ZooKeeper ACL that corresponds to this authentication, using
a command like the following:
----
[hdfs]$ java -cp $ZK_HOME/lib/*:$ZK_HOME/zookeeper-3.4.2.jar org.apache.zookeeper.server.auth.DigestAuthenticationProvider hdfs-zkfcs:mypassword
output: hdfs-zkfcs:mypassword->hdfs-zkfcs:P/OQvnYyU/nF/mGYvB/xurX8dYs=
----
Copy and paste the section of this output after the '->' string into the file
<<<zk-acls.txt>>>, prefixed by the string "<<<digest:>>>". For example:
----
digest:hdfs-zkfcs:vlUvLnd8MlacsE80rDuu6ONESbM=:rwcda
----
In order for these ACLs to take effect, you should then rerun the
<<<zkfc -formatZK>>> command as described above.
After doing so, you may verify the ACLs from the ZK CLI as follows:
----
[zk: localhost:2181(CONNECTED) 1] getAcl /hadoop-ha
'digest,'hdfs-zkfcs:vlUvLnd8MlacsE80rDuu6ONESbM=
: cdrwa
----
** Verifying automatic failover
Once automatic failover has been set up, you should test its operation. To do
so, first locate the active NameNode. You can tell which node is active by
visiting the NameNode web interfaces -- each node reports its HA state at the
top of the page.
Once you have located your active NameNode, you may cause a failure on that
node. For example, you can use <<<kill -9 <pid of NN>>>> to simulate a JVM
crash. Or, you could power cycle the machine or unplug its network interface
to simulate a different kind of outage. After triggering the outage you wish
to test, the other NameNode should automatically become active within several
seconds. The amount of time required to detect a failure and trigger a
fail-over depends on the configuration of
<<<ha.zookeeper.session-timeout.ms>>>, but defaults to 5 seconds.
If the test does not succeed, you may have a misconfiguration. Check the logs
for the <<<zkfc>>> daemons as well as the NameNode daemons in order to further
diagnose the issue.
* Automatic Failover FAQ
* <<Is it important that I start the ZKFC and NameNode daemons in any
particular order?>>
No. On any given node you may start the ZKFC before or after its corresponding
NameNode.
* <<What additional monitoring should I put in place?>>
You should add monitoring on each host that runs a NameNode to ensure that the
ZKFC remains running. In some types of ZooKeeper failures, for example, the
ZKFC may unexpectedly exit, and should be restarted to ensure that the system
is ready for automatic failover.
Additionally, you should monitor each of the servers in the ZooKeeper
quorum. If ZooKeeper crashes, then automatic failover will not function.
* <<What happens if ZooKeeper goes down?>>
If the ZooKeeper cluster crashes, no automatic failovers will be triggered.
However, HDFS will continue to run without any impact. When ZooKeeper is
restarted, HDFS will reconnect with no issues.
* <<Can I designate one of my NameNodes as primary/preferred?>>
No. Currently, this is not supported. Whichever NameNode is started first will
become active. You may choose to start the cluster in a specific order such
that your preferred node starts first.
* <<How can I initiate a manual failover when automatic failover is
configured?>>
Even if automatic failover is configured, you may initiate a manual failover
using the same <<<hdfs haadmin>>> command. It will perform a coordinated
failover.
* BookKeeper as a Shared storage (EXPERIMENTAL)
One option for shared storage for the NameNode is BookKeeper.
BookKeeper achieves high availability and strong durability guarantees by replicating
edit log entries across multiple storage nodes. The edit log can be striped across
the storage nodes for high performance. Fencing is supported in the protocol, i.e,
BookKeeper will not allow two writers to write the single edit log.
The meta data for BookKeeper is stored in ZooKeeper.
In current HA architecture, a Zookeeper cluster is required for ZKFC. The same cluster can be
for BookKeeper metadata.
For more details on building a BookKeeper cluster, please refer to the
{{{http://zookeeper.apache.org/bookkeeper/docs/trunk/bookkeeperConfig.html }BookKeeper documentation}}
The BookKeeperJournalManager is an implementation of the HDFS JournalManager interface, which allows custom write ahead logging implementations to be plugged into the HDFS NameNode.
**<<BookKeeper Journal Manager>>
To use BookKeeperJournalManager, add the following to hdfs-site.xml.
----
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>bookkeeper://zk1:2181;zk2:2181;zk3:2181/hdfsjournal</value>
</property>
<property>
<name>dfs.namenode.edits.journal-plugin.bookkeeper</name>
<value>org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager</value>
</property>
----
The URI format for bookkeeper is <<<bookkeeper://[zkEnsemble]/[rootZnode]
[zookkeeper ensemble]>>> is a list of semi-colon separated, zookeeper host:port
pairs. In the example above there are 3 servers, in the ensemble,
zk1, zk2 & zk3, each one listening on port 2181.
<<<[root znode]>>> is the path of the zookeeper znode, under which the edit log
information will be stored.
The class specified for the journal-plugin must be available in the NameNode's
classpath. We explain how to generate a jar file with the journal manager and
its dependencies, and how to put it into the classpath below.
*** <<More configuration options>>
* <<dfs.namenode.bookkeeperjournal.output-buffer-size>> -
Number of bytes a bookkeeper journal stream will buffer before
forcing a flush. Default is 1024.
----
<property>
<name>dfs.namenode.bookkeeperjournal.output-buffer-size</name>
<value>1024</value>
</property>
----
* <<dfs.namenode.bookkeeperjournal.ensemble-size>> -
Number of bookkeeper servers in edit log ensembles. This
is the number of bookkeeper servers which need to be available
for the edit log to be writable. Default is 3.
----
<property>
<name>dfs.namenode.bookkeeperjournal.ensemble-size</name>
<value>3</value>
</property>
----
* <<dfs.namenode.bookkeeperjournal.quorum-size>> -
Number of bookkeeper servers in the write quorum. This is the
number of bookkeeper servers which must have acknowledged the
write of an entry before it is considered written. Default is 2.
----
<property>
<name>dfs.namenode.bookkeeperjournal.quorum-size</name>
<value>2</value>
</property>
----
* <<dfs.namenode.bookkeeperjournal.digestPw>> -
Password to use when creating edit log segments.
----
<property>
<name>dfs.namenode.bookkeeperjournal.digestPw</name>
<value>myPassword</value>
</property>
----
* <<dfs.namenode.bookkeeperjournal.zk.session.timeout>> -
Session timeout for Zookeeper client from BookKeeper Journal Manager.
Hadoop recommends that this value should be less than the ZKFC
session timeout value. Default value is 3000.
----
<property>
<name>dfs.namenode.bookkeeperjournal.zk.session.timeout</name>
<value>3000</value>
</property>
----
*** <<Building BookKeeper Journal Manager plugin jar>>
To generate the distribution packages for BK journal, do the
following.
$ mvn clean package -Pdist
This will generate a jar with the BookKeeperJournalManager,
hadoop-hdfs/src/contrib/bkjournal/target/hadoop-hdfs-bkjournal-<VERSION>.jar
Note that the -Pdist part of the build command is important, this would
copy the dependent bookkeeper-server jar under
hadoop-hdfs/src/contrib/bkjournal/target/lib.
*** <<Putting the BookKeeperJournalManager in the NameNode classpath>>
To run a HDFS namenode using BookKeeper as a backend, copy the bkjournal and
bookkeeper-server jar, mentioned above, into the lib directory of hdfs. In the
standard distribution of HDFS, this is at $HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/
cp hadoop-hdfs/src/contrib/bkjournal/target/hadoop-hdfs-bkjournal-<VERSION>.jar $HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/
*** <<Current limitations>>
1) Security in BookKeeper. BookKeeper does not support SASL nor SSL for
connections between the NameNode and BookKeeper storage nodes.

View File

@ -1,816 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
Hadoop Distributed File System-${project.version} - High Availability
---
---
${maven.build.timestamp}
HDFS High Availability Using the Quorum Journal Manager
%{toc|section=1|fromDepth=0}
* {Purpose}
This guide provides an overview of the HDFS High Availability (HA) feature
and how to configure and manage an HA HDFS cluster, using the Quorum Journal
Manager (QJM) feature.
This document assumes that the reader has a general understanding of
general components and node types in an HDFS cluster. Please refer to the
HDFS Architecture guide for details.
* {Note: Using the Quorum Journal Manager or Conventional Shared Storage}
This guide discusses how to configure and use HDFS HA using the Quorum
Journal Manager (QJM) to share edit logs between the Active and Standby
NameNodes. For information on how to configure HDFS HA using NFS for shared
storage instead of the QJM, please see
{{{./HDFSHighAvailabilityWithNFS.html}this alternative guide.}}
* {Background}
Prior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in
an HDFS cluster. Each cluster had a single NameNode, and if that machine or
process became unavailable, the cluster as a whole would be unavailable
until the NameNode was either restarted or brought up on a separate machine.
This impacted the total availability of the HDFS cluster in two major ways:
* In the case of an unplanned event such as a machine crash, the cluster would
be unavailable until an operator restarted the NameNode.
* Planned maintenance events such as software or hardware upgrades on the
NameNode machine would result in windows of cluster downtime.
The HDFS High Availability feature addresses the above problems by providing
the option of running two redundant NameNodes in the same cluster in an
Active/Passive configuration with a hot standby. This allows a fast failover to
a new NameNode in the case that a machine crashes, or a graceful
administrator-initiated failover for the purpose of planned maintenance.
* {Architecture}
In a typical HA cluster, two separate machines are configured as NameNodes.
At any point in time, exactly one of the NameNodes is in an <Active> state,
and the other is in a <Standby> state. The Active NameNode is responsible
for all client operations in the cluster, while the Standby is simply acting
as a slave, maintaining enough state to provide a fast failover if
necessary.
In order for the Standby node to keep its state synchronized with the Active
node, both nodes communicate with a group of separate daemons called
"JournalNodes" (JNs). When any namespace modification is performed by the
Active node, it durably logs a record of the modification to a majority of
these JNs. The Standby node is capable of reading the edits from the JNs, and
is constantly watching them for changes to the edit log. As the Standby Node
sees the edits, it applies them to its own namespace. In the event of a
failover, the Standby will ensure that it has read all of the edits from the
JounalNodes before promoting itself to the Active state. This ensures that the
namespace state is fully synchronized before a failover occurs.
In order to provide a fast failover, it is also necessary that the Standby node
have up-to-date information regarding the location of blocks in the cluster.
In order to achieve this, the DataNodes are configured with the location of
both NameNodes, and send block location information and heartbeats to both.
It is vital for the correct operation of an HA cluster that only one of the
NameNodes be Active at a time. Otherwise, the namespace state would quickly
diverge between the two, risking data loss or other incorrect results. In
order to ensure this property and prevent the so-called "split-brain scenario,"
the JournalNodes will only ever allow a single NameNode to be a writer at a
time. During a failover, the NameNode which is to become active will simply
take over the role of writing to the JournalNodes, which will effectively
prevent the other NameNode from continuing in the Active state, allowing the
new Active to safely proceed with failover.
* {Hardware resources}
In order to deploy an HA cluster, you should prepare the following:
* <<NameNode machines>> - the machines on which you run the Active and
Standby NameNodes should have equivalent hardware to each other, and
equivalent hardware to what would be used in a non-HA cluster.
* <<JournalNode machines>> - the machines on which you run the JournalNodes.
The JournalNode daemon is relatively lightweight, so these daemons may
reasonably be collocated on machines with other Hadoop daemons, for example
NameNodes, the JobTracker, or the YARN ResourceManager. <<Note:>> There
must be at least 3 JournalNode daemons, since edit log modifications must be
written to a majority of JNs. This will allow the system to tolerate the
failure of a single machine. You may also run more than 3 JournalNodes, but
in order to actually increase the number of failures the system can tolerate,
you should run an odd number of JNs, (i.e. 3, 5, 7, etc.). Note that when
running with N JournalNodes, the system can tolerate at most (N - 1) / 2
failures and continue to function normally.
Note that, in an HA cluster, the Standby NameNode also performs checkpoints of
the namespace state, and thus it is not necessary to run a Secondary NameNode,
CheckpointNode, or BackupNode in an HA cluster. In fact, to do so would be an
error. This also allows one who is reconfiguring a non-HA-enabled HDFS cluster
to be HA-enabled to reuse the hardware which they had previously dedicated to
the Secondary NameNode.
* {Deployment}
** Configuration overview
Similar to Federation configuration, HA configuration is backward compatible
and allows existing single NameNode configurations to work without change.
The new configuration is designed such that all the nodes in the cluster may
have the same configuration without the need for deploying different
configuration files to different machines based on the type of the node.
Like HDFS Federation, HA clusters reuse the <<<nameservice ID>>> to identify a
single HDFS instance that may in fact consist of multiple HA NameNodes. In
addition, a new abstraction called <<<NameNode ID>>> is added with HA. Each
distinct NameNode in the cluster has a different NameNode ID to distinguish it.
To support a single configuration file for all of the NameNodes, the relevant
configuration parameters are suffixed with the <<nameservice ID>> as well as
the <<NameNode ID>>.
** Configuration details
To configure HA NameNodes, you must add several configuration options to your
<<hdfs-site.xml>> configuration file.
The order in which you set these configurations is unimportant, but the values
you choose for <<dfs.nameservices>> and
<<dfs.ha.namenodes.[nameservice ID]>> will determine the keys of those that
follow. Thus, you should decide on these values before setting the rest of the
configuration options.
* <<dfs.nameservices>> - the logical name for this new nameservice
Choose a logical name for this nameservice, for example "mycluster", and use
this logical name for the value of this config option. The name you choose is
arbitrary. It will be used both for configuration and as the authority
component of absolute HDFS paths in the cluster.
<<Note:>> If you are also using HDFS Federation, this configuration setting
should also include the list of other nameservices, HA or otherwise, as a
comma-separated list.
----
<property>
<name>dfs.nameservices</name>
<value>mycluster</value>
</property>
----
* <<dfs.ha.namenodes.[nameservice ID]>> - unique identifiers for each NameNode in the nameservice
Configure with a list of comma-separated NameNode IDs. This will be used by
DataNodes to determine all the NameNodes in the cluster. For example, if you
used "mycluster" as the nameservice ID previously, and you wanted to use "nn1"
and "nn2" as the individual IDs of the NameNodes, you would configure this as
such:
----
<property>
<name>dfs.ha.namenodes.mycluster</name>
<value>nn1,nn2</value>
</property>
----
<<Note:>> Currently, only a maximum of two NameNodes may be configured per
nameservice.
* <<dfs.namenode.rpc-address.[nameservice ID].[name node ID]>> - the fully-qualified RPC address for each NameNode to listen on
For both of the previously-configured NameNode IDs, set the full address and
IPC port of the NameNode processs. Note that this results in two separate
configuration options. For example:
----
<property>
<name>dfs.namenode.rpc-address.mycluster.nn1</name>
<value>machine1.example.com:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.mycluster.nn2</name>
<value>machine2.example.com:8020</value>
</property>
----
<<Note:>> You may similarly configure the "<<servicerpc-address>>" setting if
you so desire.
* <<dfs.namenode.http-address.[nameservice ID].[name node ID]>> - the fully-qualified HTTP address for each NameNode to listen on
Similarly to <rpc-address> above, set the addresses for both NameNodes' HTTP
servers to listen on. For example:
----
<property>
<name>dfs.namenode.http-address.mycluster.nn1</name>
<value>machine1.example.com:50070</value>
</property>
<property>
<name>dfs.namenode.http-address.mycluster.nn2</name>
<value>machine2.example.com:50070</value>
</property>
----
<<Note:>> If you have Hadoop's security features enabled, you should also set
the <https-address> similarly for each NameNode.
* <<dfs.namenode.shared.edits.dir>> - the URI which identifies the group of JNs where the NameNodes will write/read edits
This is where one configures the addresses of the JournalNodes which provide
the shared edits storage, written to by the Active nameNode and read by the
Standby NameNode to stay up-to-date with all the file system changes the Active
NameNode makes. Though you must specify several JournalNode addresses,
<<you should only configure one of these URIs.>> The URI should be of the form:
"qjournal://<host1:port1>;<host2:port2>;<host3:port3>/<journalId>". The Journal
ID is a unique identifier for this nameservice, which allows a single set of
JournalNodes to provide storage for multiple federated namesystems. Though not
a requirement, it's a good idea to reuse the nameservice ID for the journal
identifier.
For example, if the JournalNodes for this cluster were running on the
machines "node1.example.com", "node2.example.com", and "node3.example.com" and
the nameservice ID were "mycluster", you would use the following as the value
for this setting (the default port for the JournalNode is 8485):
----
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://node1.example.com:8485;node2.example.com:8485;node3.example.com:8485/mycluster</value>
</property>
----
* <<dfs.client.failover.proxy.provider.[nameservice ID]>> - the Java class that HDFS clients use to contact the Active NameNode
Configure the name of the Java class which will be used by the DFS Client to
determine which NameNode is the current Active, and therefore which NameNode is
currently serving client requests. The only implementation which currently
ships with Hadoop is the <<ConfiguredFailoverProxyProvider>>, so use this
unless you are using a custom one. For example:
----
<property>
<name>dfs.client.failover.proxy.provider.mycluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
----
* <<dfs.ha.fencing.methods>> - a list of scripts or Java classes which will be used to fence the Active NameNode during a failover
It is desirable for correctness of the system that only one NameNode be in
the Active state at any given time. <<Importantly, when using the Quorum
Journal Manager, only one NameNode will ever be allowed to write to the
JournalNodes, so there is no potential for corrupting the file system metadata
from a split-brain scenario.>> However, when a failover occurs, it is still
possible that the previous Active NameNode could serve read requests to
clients, which may be out of date until that NameNode shuts down when trying to
write to the JournalNodes. For this reason, it is still desirable to configure
some fencing methods even when using the Quorum Journal Manager. However, to
improve the availability of the system in the event the fencing mechanisms
fail, it is advisable to configure a fencing method which is guaranteed to
return success as the last fencing method in the list. Note that if you choose
to use no actual fencing methods, you still must configure something for this
setting, for example "<<<shell(/bin/true)>>>".
The fencing methods used during a failover are configured as a
carriage-return-separated list, which will be attempted in order until one
indicates that fencing has succeeded. There are two methods which ship with
Hadoop: <shell> and <sshfence>. For information on implementing your own custom
fencing method, see the <org.apache.hadoop.ha.NodeFencer> class.
* <<sshfence>> - SSH to the Active NameNode and kill the process
The <sshfence> option SSHes to the target node and uses <fuser> to kill the
process listening on the service's TCP port. In order for this fencing option
to work, it must be able to SSH to the target node without providing a
passphrase. Thus, one must also configure the
<<dfs.ha.fencing.ssh.private-key-files>> option, which is a
comma-separated list of SSH private key files. For example:
---
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/home/exampleuser/.ssh/id_rsa</value>
</property>
---
Optionally, one may configure a non-standard username or port to perform the
SSH. One may also configure a timeout, in milliseconds, for the SSH, after
which this fencing method will be considered to have failed. It may be
configured like so:
---
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence([[username][:port]])</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.connect-timeout</name>
<value>30000</value>
</property>
---
* <<shell>> - run an arbitrary shell command to fence the Active NameNode
The <shell> fencing method runs an arbitrary shell command. It may be
configured like so:
---
<property>
<name>dfs.ha.fencing.methods</name>
<value>shell(/path/to/my/script.sh arg1 arg2 ...)</value>
</property>
---
The string between '(' and ')' is passed directly to a bash shell and may not
include any closing parentheses.
The shell command will be run with an environment set up to contain all of the
current Hadoop configuration variables, with the '_' character replacing any
'.' characters in the configuration keys. The configuration used has already had
any namenode-specific configurations promoted to their generic forms -- for example
<<dfs_namenode_rpc-address>> will contain the RPC address of the target node, even
though the configuration may specify that variable as
<<dfs.namenode.rpc-address.ns1.nn1>>.
Additionally, the following variables referring to the target node to be fenced
are also available:
*-----------------------:-----------------------------------+
| $target_host | hostname of the node to be fenced |
*-----------------------:-----------------------------------+
| $target_port | IPC port of the node to be fenced |
*-----------------------:-----------------------------------+
| $target_address | the above two, combined as host:port |
*-----------------------:-----------------------------------+
| $target_nameserviceid | the nameservice ID of the NN to be fenced |
*-----------------------:-----------------------------------+
| $target_namenodeid | the namenode ID of the NN to be fenced |
*-----------------------:-----------------------------------+
These environment variables may also be used as substitutions in the shell
command itself. For example:
---
<property>
<name>dfs.ha.fencing.methods</name>
<value>shell(/path/to/my/script.sh --nameservice=$target_nameserviceid $target_host:$target_port)</value>
</property>
---
If the shell command returns an exit
code of 0, the fencing is determined to be successful. If it returns any other
exit code, the fencing was not successful and the next fencing method in the
list will be attempted.
<<Note:>> This fencing method does not implement any timeout. If timeouts are
necessary, they should be implemented in the shell script itself (eg by forking
a subshell to kill its parent in some number of seconds).
* <<fs.defaultFS>> - the default path prefix used by the Hadoop FS client when none is given
Optionally, you may now configure the default path for Hadoop clients to use
the new HA-enabled logical URI. If you used "mycluster" as the nameservice ID
earlier, this will be the value of the authority portion of all of your HDFS
paths. This may be configured like so, in your <<core-site.xml>> file:
---
<property>
<name>fs.defaultFS</name>
<value>hdfs://mycluster</value>
</property>
---
* <<dfs.journalnode.edits.dir>> - the path where the JournalNode daemon will store its local state
This is the absolute path on the JournalNode machines where the edits and
other local state used by the JNs will be stored. You may only use a single
path for this configuration. Redundancy for this data is provided by running
multiple separate JournalNodes, or by configuring this directory on a
locally-attached RAID array. For example:
---
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/path/to/journal/node/local/data</value>
</property>
---
** Deployment details
After all of the necessary configuration options have been set, you must
start the JournalNode daemons on the set of machines where they will run. This
can be done by running the command "<hadoop-daemon.sh start journalnode>" and
waiting for the daemon to start on each of the relevant machines.
Once the JournalNodes have been started, one must initially synchronize the
two HA NameNodes' on-disk metadata.
* If you are setting up a fresh HDFS cluster, you should first run the format
command (<hdfs namenode -format>) on one of NameNodes.
* If you have already formatted the NameNode, or are converting a
non-HA-enabled cluster to be HA-enabled, you should now copy over the
contents of your NameNode metadata directories to the other, unformatted
NameNode by running the command "<hdfs namenode -bootstrapStandby>" on the
unformatted NameNode. Running this command will also ensure that the
JournalNodes (as configured by <<dfs.namenode.shared.edits.dir>>) contain
sufficient edits transactions to be able to start both NameNodes.
* If you are converting a non-HA NameNode to be HA, you should run the
command "<hdfs -initializeSharedEdits>", which will initialize the
JournalNodes with the edits data from the local NameNode edits directories.
At this point you may start both of your HA NameNodes as you normally would
start a NameNode.
You can visit each of the NameNodes' web pages separately by browsing to their
configured HTTP addresses. You should notice that next to the configured
address will be the HA state of the NameNode (either "standby" or "active".)
Whenever an HA NameNode starts, it is initially in the Standby state.
** Administrative commands
Now that your HA NameNodes are configured and started, you will have access
to some additional commands to administer your HA HDFS cluster. Specifically,
you should familiarize yourself with all of the subcommands of the "<hdfs
haadmin>" command. Running this command without any additional arguments will
display the following usage information:
---
Usage: DFSHAAdmin [-ns <nameserviceId>]
[-transitionToActive <serviceId>]
[-transitionToStandby <serviceId>]
[-failover [--forcefence] [--forceactive] <serviceId> <serviceId>]
[-getServiceState <serviceId>]
[-checkHealth <serviceId>]
[-help <command>]
---
This guide describes high-level uses of each of these subcommands. For
specific usage information of each subcommand, you should run "<hdfs haadmin
-help <command>>".
* <<transitionToActive>> and <<transitionToStandby>> - transition the state of the given NameNode to Active or Standby
These subcommands cause a given NameNode to transition to the Active or Standby
state, respectively. <<These commands do not attempt to perform any fencing,
and thus should rarely be used.>> Instead, one should almost always prefer to
use the "<hdfs haadmin -failover>" subcommand.
* <<failover>> - initiate a failover between two NameNodes
This subcommand causes a failover from the first provided NameNode to the
second. If the first NameNode is in the Standby state, this command simply
transitions the second to the Active state without error. If the first NameNode
is in the Active state, an attempt will be made to gracefully transition it to
the Standby state. If this fails, the fencing methods (as configured by
<<dfs.ha.fencing.methods>>) will be attempted in order until one
succeeds. Only after this process will the second NameNode be transitioned to
the Active state. If no fencing method succeeds, the second NameNode will not
be transitioned to the Active state, and an error will be returned.
* <<getServiceState>> - determine whether the given NameNode is Active or Standby
Connect to the provided NameNode to determine its current state, printing
either "standby" or "active" to STDOUT appropriately. This subcommand might be
used by cron jobs or monitoring scripts which need to behave differently based
on whether the NameNode is currently Active or Standby.
* <<checkHealth>> - check the health of the given NameNode
Connect to the provided NameNode to check its health. The NameNode is capable
of performing some diagnostics on itself, including checking if internal
services are running as expected. This command will return 0 if the NameNode is
healthy, non-zero otherwise. One might use this command for monitoring
purposes.
<<Note:>> This is not yet implemented, and at present will always return
success, unless the given NameNode is completely down.
* {Automatic Failover}
** Introduction
The above sections describe how to configure manual failover. In that mode,
the system will not automatically trigger a failover from the active to the
standby NameNode, even if the active node has failed. This section describes
how to configure and deploy automatic failover.
** Components
Automatic failover adds two new components to an HDFS deployment: a ZooKeeper
quorum, and the ZKFailoverController process (abbreviated as ZKFC).
Apache ZooKeeper is a highly available service for maintaining small amounts
of coordination data, notifying clients of changes in that data, and
monitoring clients for failures. The implementation of automatic HDFS failover
relies on ZooKeeper for the following things:
* <<Failure detection>> - each of the NameNode machines in the cluster
maintains a persistent session in ZooKeeper. If the machine crashes, the
ZooKeeper session will expire, notifying the other NameNode that a failover
should be triggered.
* <<Active NameNode election>> - ZooKeeper provides a simple mechanism to
exclusively elect a node as active. If the current active NameNode crashes,
another node may take a special exclusive lock in ZooKeeper indicating that
it should become the next active.
The ZKFailoverController (ZKFC) is a new component which is a ZooKeeper client
which also monitors and manages the state of the NameNode. Each of the
machines which runs a NameNode also runs a ZKFC, and that ZKFC is responsible
for:
* <<Health monitoring>> - the ZKFC pings its local NameNode on a periodic
basis with a health-check command. So long as the NameNode responds in a
timely fashion with a healthy status, the ZKFC considers the node
healthy. If the node has crashed, frozen, or otherwise entered an unhealthy
state, the health monitor will mark it as unhealthy.
* <<ZooKeeper session management>> - when the local NameNode is healthy, the
ZKFC holds a session open in ZooKeeper. If the local NameNode is active, it
also holds a special "lock" znode. This lock uses ZooKeeper's support for
"ephemeral" nodes; if the session expires, the lock node will be
automatically deleted.
* <<ZooKeeper-based election>> - if the local NameNode is healthy, and the
ZKFC sees that no other node currently holds the lock znode, it will itself
try to acquire the lock. If it succeeds, then it has "won the election", and
is responsible for running a failover to make its local NameNode active. The
failover process is similar to the manual failover described above: first,
the previous active is fenced if necessary, and then the local NameNode
transitions to active state.
For more details on the design of automatic failover, refer to the design
document attached to HDFS-2185 on the Apache HDFS JIRA.
** Deploying ZooKeeper
In a typical deployment, ZooKeeper daemons are configured to run on three or
five nodes. Since ZooKeeper itself has light resource requirements, it is
acceptable to collocate the ZooKeeper nodes on the same hardware as the HDFS
NameNode and Standby Node. Many operators choose to deploy the third ZooKeeper
process on the same node as the YARN ResourceManager. It is advisable to
configure the ZooKeeper nodes to store their data on separate disk drives from
the HDFS metadata for best performance and isolation.
The setup of ZooKeeper is out of scope for this document. We will assume that
you have set up a ZooKeeper cluster running on three or more nodes, and have
verified its correct operation by connecting using the ZK CLI.
** Before you begin
Before you begin configuring automatic failover, you should shut down your
cluster. It is not currently possible to transition from a manual failover
setup to an automatic failover setup while the cluster is running.
** Configuring automatic failover
The configuration of automatic failover requires the addition of two new
parameters to your configuration. In your <<<hdfs-site.xml>>> file, add:
----
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
----
This specifies that the cluster should be set up for automatic failover.
In your <<<core-site.xml>>> file, add:
----
<property>
<name>ha.zookeeper.quorum</name>
<value>zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181</value>
</property>
----
This lists the host-port pairs running the ZooKeeper service.
As with the parameters described earlier in the document, these settings may
be configured on a per-nameservice basis by suffixing the configuration key
with the nameservice ID. For example, in a cluster with federation enabled,
you can explicitly enable automatic failover for only one of the nameservices
by setting <<<dfs.ha.automatic-failover.enabled.my-nameservice-id>>>.
There are also several other configuration parameters which may be set to
control the behavior of automatic failover; however, they are not necessary
for most installations. Please refer to the configuration key specific
documentation for details.
** Initializing HA state in ZooKeeper
After the configuration keys have been added, the next step is to initialize
required state in ZooKeeper. You can do so by running the following command
from one of the NameNode hosts.
----
[hdfs]$ $HADOOP_PREFIX/bin/hdfs zkfc -formatZK
----
This will create a znode in ZooKeeper inside of which the automatic failover
system stores its data.
** Starting the cluster with <<<start-dfs.sh>>>
Since automatic failover has been enabled in the configuration, the
<<<start-dfs.sh>>> script will now automatically start a ZKFC daemon on any
machine that runs a NameNode. When the ZKFCs start, they will automatically
select one of the NameNodes to become active.
** Starting the cluster manually
If you manually manage the services on your cluster, you will need to manually
start the <<<zkfc>>> daemon on each of the machines that runs a NameNode. You
can start the daemon by running:
----
[hdfs]$ $HADOOP_PREFIX/bin/hdfs --daemon start zkfc
----
** Securing access to ZooKeeper
If you are running a secure cluster, you will likely want to ensure that the
information stored in ZooKeeper is also secured. This prevents malicious
clients from modifying the metadata in ZooKeeper or potentially triggering a
false failover.
In order to secure the information in ZooKeeper, first add the following to
your <<<core-site.xml>>> file:
----
<property>
<name>ha.zookeeper.auth</name>
<value>@/path/to/zk-auth.txt</value>
</property>
<property>
<name>ha.zookeeper.acl</name>
<value>@/path/to/zk-acl.txt</value>
</property>
----
Please note the '@' character in these values -- this specifies that the
configurations are not inline, but rather point to a file on disk.
The first configured file specifies a list of ZooKeeper authentications, in
the same format as used by the ZK CLI. For example, you may specify something
like:
----
digest:hdfs-zkfcs:mypassword
----
...where <<<hdfs-zkfcs>>> is a unique username for ZooKeeper, and
<<<mypassword>>> is some unique string used as a password.
Next, generate a ZooKeeper ACL that corresponds to this authentication, using
a command like the following:
----
[hdfs]$ java -cp $ZK_HOME/lib/*:$ZK_HOME/zookeeper-3.4.2.jar org.apache.zookeeper.server.auth.DigestAuthenticationProvider hdfs-zkfcs:mypassword
output: hdfs-zkfcs:mypassword->hdfs-zkfcs:P/OQvnYyU/nF/mGYvB/xurX8dYs=
----
Copy and paste the section of this output after the '->' string into the file
<<<zk-acls.txt>>>, prefixed by the string "<<<digest:>>>". For example:
----
digest:hdfs-zkfcs:vlUvLnd8MlacsE80rDuu6ONESbM=:rwcda
----
In order for these ACLs to take effect, you should then rerun the
<<<zkfc -formatZK>>> command as described above.
After doing so, you may verify the ACLs from the ZK CLI as follows:
----
[zk: localhost:2181(CONNECTED) 1] getAcl /hadoop-ha
'digest,'hdfs-zkfcs:vlUvLnd8MlacsE80rDuu6ONESbM=
: cdrwa
----
** Verifying automatic failover
Once automatic failover has been set up, you should test its operation. To do
so, first locate the active NameNode. You can tell which node is active by
visiting the NameNode web interfaces -- each node reports its HA state at the
top of the page.
Once you have located your active NameNode, you may cause a failure on that
node. For example, you can use <<<kill -9 <pid of NN>>>> to simulate a JVM
crash. Or, you could power cycle the machine or unplug its network interface
to simulate a different kind of outage. After triggering the outage you wish
to test, the other NameNode should automatically become active within several
seconds. The amount of time required to detect a failure and trigger a
fail-over depends on the configuration of
<<<ha.zookeeper.session-timeout.ms>>>, but defaults to 5 seconds.
If the test does not succeed, you may have a misconfiguration. Check the logs
for the <<<zkfc>>> daemons as well as the NameNode daemons in order to further
diagnose the issue.
* Automatic Failover FAQ
* <<Is it important that I start the ZKFC and NameNode daemons in any
particular order?>>
No. On any given node you may start the ZKFC before or after its corresponding
NameNode.
* <<What additional monitoring should I put in place?>>
You should add monitoring on each host that runs a NameNode to ensure that the
ZKFC remains running. In some types of ZooKeeper failures, for example, the
ZKFC may unexpectedly exit, and should be restarted to ensure that the system
is ready for automatic failover.
Additionally, you should monitor each of the servers in the ZooKeeper
quorum. If ZooKeeper crashes, then automatic failover will not function.
* <<What happens if ZooKeeper goes down?>>
If the ZooKeeper cluster crashes, no automatic failovers will be triggered.
However, HDFS will continue to run without any impact. When ZooKeeper is
restarted, HDFS will reconnect with no issues.
* <<Can I designate one of my NameNodes as primary/preferred?>>
No. Currently, this is not supported. Whichever NameNode is started first will
become active. You may choose to start the cluster in a specific order such
that your preferred node starts first.
* <<How can I initiate a manual failover when automatic failover is
configured?>>
Even if automatic failover is configured, you may initiate a manual failover
using the same <<<hdfs haadmin>>> command. It will perform a coordinated
failover.
* HDFS Upgrade/Finalization/Rollback with HA Enabled
When moving between versions of HDFS, sometimes the newer software can simply
be installed and the cluster restarted. Sometimes, however, upgrading the
version of HDFS you're running may require changing on-disk data. In this case,
one must use the HDFS Upgrade/Finalize/Rollback facility after installing the
new software. This process is made more complex in an HA environment, since the
on-disk metadata that the NN relies upon is by definition distributed, both on
the two HA NNs in the pair, and on the JournalNodes in the case that QJM is
being used for the shared edits storage. This documentation section describes
the procedure to use the HDFS Upgrade/Finalize/Rollback facility in an HA setup.
<<To perform an HA upgrade>>, the operator must do the following:
[[1]] Shut down all of the NNs as normal, and install the newer software.
[[2]] Start up all of the JNs. Note that it is <<critical>> that all the
JNs be running when performing the upgrade, rollback, or finalization
operations. If any of the JNs are down at the time of running any of these
operations, the operation will fail.
[[3]] Start one of the NNs with the <<<'-upgrade'>>> flag.
[[4]] On start, this NN will not enter the standby state as usual in an HA
setup. Rather, this NN will immediately enter the active state, perform an
upgrade of its local storage dirs, and also perform an upgrade of the shared
edit log.
[[5]] At this point the other NN in the HA pair will be out of sync with
the upgraded NN. In order to bring it back in sync and once again have a highly
available setup, you should re-bootstrap this NameNode by running the NN with
the <<<'-bootstrapStandby'>>> flag. It is an error to start this second NN with
the <<<'-upgrade'>>> flag.
Note that if at any time you want to restart the NameNodes before finalizing
or rolling back the upgrade, you should start the NNs as normal, i.e. without
any special startup flag.
<<To finalize an HA upgrade>>, the operator will use the <<<`hdfs
dfsadmin -finalizeUpgrade'>>> command while the NNs are running and one of them
is active. The active NN at the time this happens will perform the finalization
of the shared log, and the NN whose local storage directories contain the
previous FS state will delete its local state.
<<To perform a rollback>> of an upgrade, both NNs should first be shut down.
The operator should run the roll back command on the NN where they initiated
the upgrade procedure, which will perform the rollback on the local dirs there,
as well as on the shared log, either NFS or on the JNs. Afterward, this NN
should be started and the operator should run <<<`-bootstrapStandby'>>> on the
other NN to bring the two NNs in sync with this rolled-back file system state.

View File

@ -1,510 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
HDFS Architecture
---
${maven.build.timestamp}
HDFS Architecture
%{toc|section=1|fromDepth=0}
* Introduction
The Hadoop Distributed File System (HDFS) is a distributed file system
designed to run on commodity hardware. It has many similarities with
existing distributed file systems. However, the differences from other
distributed file systems are significant. HDFS is highly fault-tolerant
and is designed to be deployed on low-cost hardware. HDFS provides high
throughput access to application data and is suitable for applications
that have large data sets. HDFS relaxes a few POSIX requirements to
enable streaming access to file system data. HDFS was originally built
as infrastructure for the Apache Nutch web search engine project. HDFS
is part of the Apache Hadoop Core project. The project URL is
{{http://hadoop.apache.org/}}.
* Assumptions and Goals
** Hardware Failure
Hardware failure is the norm rather than the exception. An HDFS
instance may consist of hundreds or thousands of server machines, each
storing part of the file systems data. The fact that there are a huge
number of components and that each component has a non-trivial
probability of failure means that some component of HDFS is always
non-functional. Therefore, detection of faults and quick, automatic
recovery from them is a core architectural goal of HDFS.
** Streaming Data Access
Applications that run on HDFS need streaming access to their data sets.
They are not general purpose applications that typically run on general
purpose file systems. HDFS is designed more for batch processing rather
than interactive use by users. The emphasis is on high throughput of
data access rather than low latency of data access. POSIX imposes many
hard requirements that are not needed for applications that are
targeted for HDFS. POSIX semantics in a few key areas has been traded
to increase data throughput rates.
** Large Data Sets
Applications that run on HDFS have large data sets. A typical file in
HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support
large files. It should provide high aggregate data bandwidth and scale
to hundreds of nodes in a single cluster. It should support tens of
millions of files in a single instance.
** Simple Coherency Model
HDFS applications need a write-once-read-many access model for files. A
file once created, written, and closed need not be changed. This
assumption simplifies data coherency issues and enables high throughput
data access. A Map/Reduce application or a web crawler application fits
perfectly with this model. There is a plan to support appending-writes
to files in the future.
** “Moving Computation is Cheaper than Moving Data”
A computation requested by an application is much more efficient if it
is executed near the data it operates on. This is especially true when
the size of the data set is huge. This minimizes network congestion and
increases the overall throughput of the system. The assumption is that
it is often better to migrate the computation closer to where the data
is located rather than moving the data to where the application is
running. HDFS provides interfaces for applications to move themselves
closer to where the data is located.
** Portability Across Heterogeneous Hardware and Software Platforms
HDFS has been designed to be easily portable from one platform to
another. This facilitates widespread adoption of HDFS as a platform of
choice for a large set of applications.
* NameNode and DataNodes
HDFS has a master/slave architecture. An HDFS cluster consists of a
single NameNode, a master server that manages the file system namespace
and regulates access to files by clients. In addition, there are a
number of DataNodes, usually one per node in the cluster, which manage
storage attached to the nodes that they run on. HDFS exposes a file
system namespace and allows user data to be stored in files.
Internally, a file is split into one or more blocks and these blocks
are stored in a set of DataNodes. The NameNode executes file system
namespace operations like opening, closing, and renaming files and
directories. It also determines the mapping of blocks to DataNodes. The
DataNodes are responsible for serving read and write requests from the
file systems clients. The DataNodes also perform block creation,
deletion, and replication upon instruction from the NameNode.
[images/hdfsarchitecture.png] HDFS Architecture
The NameNode and DataNode are pieces of software designed to run on
commodity machines. These machines typically run a GNU/Linux operating
system (OS). HDFS is built using the Java language; any machine that
supports Java can run the NameNode or the DataNode software. Usage of
the highly portable Java language means that HDFS can be deployed on a
wide range of machines. A typical deployment has a dedicated machine
that runs only the NameNode software. Each of the other machines in the
cluster runs one instance of the DataNode software. The architecture
does not preclude running multiple DataNodes on the same machine but in
a real deployment that is rarely the case.
The existence of a single NameNode in a cluster greatly simplifies the
architecture of the system. The NameNode is the arbitrator and
repository for all HDFS metadata. The system is designed in such a way
that user data never flows through the NameNode.
* The File System Namespace
HDFS supports a traditional hierarchical file organization. A user or
an application can create directories and store files inside these
directories. The file system namespace hierarchy is similar to most
other existing file systems; one can create and remove files, move a
file from one directory to another, or rename a file. HDFS does not yet
implement user quotas or access permissions. HDFS does not support hard
links or soft links. However, the HDFS architecture does not preclude
implementing these features.
The NameNode maintains the file system namespace. Any change to the
file system namespace or its properties is recorded by the NameNode. An
application can specify the number of replicas of a file that should be
maintained by HDFS. The number of copies of a file is called the
replication factor of that file. This information is stored by the
NameNode.
* Data Replication
HDFS is designed to reliably store very large files across machines in
a large cluster. It stores each file as a sequence of blocks; all
blocks in a file except the last block are the same size. The blocks of
a file are replicated for fault tolerance. The block size and
replication factor are configurable per file. An application can
specify the number of replicas of a file. The replication factor can be
specified at file creation time and can be changed later. Files in HDFS
are write-once and have strictly one writer at any time.
The NameNode makes all decisions regarding replication of blocks. It
periodically receives a Heartbeat and a Blockreport from each of the
DataNodes in the cluster. Receipt of a Heartbeat implies that the
DataNode is functioning properly. A Blockreport contains a list of all
blocks on a DataNode.
[images/hdfsdatanodes.png] HDFS DataNodes
** Replica Placement: The First Baby Steps
The placement of replicas is critical to HDFS reliability and
performance. Optimizing replica placement distinguishes HDFS from most
other distributed file systems. This is a feature that needs lots of
tuning and experience. The purpose of a rack-aware replica placement
policy is to improve data reliability, availability, and network
bandwidth utilization. The current implementation for the replica
placement policy is a first effort in this direction. The short-term
goals of implementing this policy are to validate it on production
systems, learn more about its behavior, and build a foundation to test
and research more sophisticated policies.
Large HDFS instances run on a cluster of computers that commonly spread
across many racks. Communication between two nodes in different racks
has to go through switches. In most cases, network bandwidth between
machines in the same rack is greater than network bandwidth between
machines in different racks.
The NameNode determines the rack id each DataNode belongs to via the
process outlined in {{{../hadoop-common/ClusterSetup.html#Hadoop+Rack+Awareness}Hadoop Rack Awareness}}. A simple but non-optimal policy
is to place replicas on unique racks. This prevents losing data when an
entire rack fails and allows use of bandwidth from multiple racks when
reading data. This policy evenly distributes replicas in the cluster
which makes it easy to balance load on component failure. However, this
policy increases the cost of writes because a write needs to transfer
blocks to multiple racks.
For the common case, when the replication factor is three, HDFSs
placement policy is to put one replica on one node in the local rack,
another on a different node in the local rack, and the last on a
different node in a different rack. This policy cuts the inter-rack
write traffic which generally improves write performance. The chance of
rack failure is far less than that of node failure; this policy does
not impact data reliability and availability guarantees. However, it
does reduce the aggregate network bandwidth used when reading data
since a block is placed in only two unique racks rather than three.
With this policy, the replicas of a file do not evenly distribute
across the racks. One third of replicas are on one node, two thirds of
replicas are on one rack, and the other third are evenly distributed
across the remaining racks. This policy improves write performance
without compromising data reliability or read performance.
The current, default replica placement policy described here is a work
in progress.
** Replica Selection
To minimize global bandwidth consumption and read latency, HDFS tries
to satisfy a read request from a replica that is closest to the reader.
If there exists a replica on the same rack as the reader node, then
that replica is preferred to satisfy the read request. If angg/ HDFS
cluster spans multiple data centers, then a replica that is resident in
the local data center is preferred over any remote replica.
** Safemode
On startup, the NameNode enters a special state called Safemode.
Replication of data blocks does not occur when the NameNode is in the
Safemode state. The NameNode receives Heartbeat and Blockreport
messages from the DataNodes. A Blockreport contains the list of data
blocks that a DataNode is hosting. Each block has a specified minimum
number of replicas. A block is considered safely replicated when the
minimum number of replicas of that data block has checked in with the
NameNode. After a configurable percentage of safely replicated data
blocks checks in with the NameNode (plus an additional 30 seconds), the
NameNode exits the Safemode state. It then determines the list of data
blocks (if any) that still have fewer than the specified number of
replicas. The NameNode then replicates these blocks to other DataNodes.
* The Persistence of File System Metadata
The HDFS namespace is stored by the NameNode. The NameNode uses a
transaction log called the EditLog to persistently record every change
that occurs to file system metadata. For example, creating a new file
in HDFS causes the NameNode to insert a record into the EditLog
indicating this. Similarly, changing the replication factor of a file
causes a new record to be inserted into the EditLog. The NameNode uses
a file in its local host OS file system to store the EditLog. The
entire file system namespace, including the mapping of blocks to files
and file system properties, is stored in a file called the FsImage. The
FsImage is stored as a file in the NameNodes local file system too.
The NameNode keeps an image of the entire file system namespace and
file Blockmap in memory. This key metadata item is designed to be
compact, such that a NameNode with 4 GB of RAM is plenty to support a
huge number of files and directories. When the NameNode starts up, it
reads the FsImage and EditLog from disk, applies all the transactions
from the EditLog to the in-memory representation of the FsImage, and
flushes out this new version into a new FsImage on disk. It can then
truncate the old EditLog because its transactions have been applied to
the persistent FsImage. This process is called a checkpoint. In the
current implementation, a checkpoint only occurs when the NameNode
starts up. Work is in progress to support periodic checkpointing in the
near future.
The DataNode stores HDFS data in files in its local file system. The
DataNode has no knowledge about HDFS files. It stores each block of
HDFS data in a separate file in its local file system. The DataNode
does not create all files in the same directory. Instead, it uses a
heuristic to determine the optimal number of files per directory and
creates subdirectories appropriately. It is not optimal to create all
local files in the same directory because the local file system might
not be able to efficiently support a huge number of files in a single
directory. When a DataNode starts up, it scans through its local file
system, generates a list of all HDFS data blocks that correspond to
each of these local files and sends this report to the NameNode: this
is the Blockreport.
* The Communication Protocols
All HDFS communication protocols are layered on top of the TCP/IP
protocol. A client establishes a connection to a configurable TCP port
on the NameNode machine. It talks the ClientProtocol with the NameNode.
The DataNodes talk to the NameNode using the DataNode Protocol. A
Remote Procedure Call (RPC) abstraction wraps both the Client Protocol
and the DataNode Protocol. By design, the NameNode never initiates any
RPCs. Instead, it only responds to RPC requests issued by DataNodes or
clients.
* Robustness
The primary objective of HDFS is to store data reliably even in the
presence of failures. The three common types of failures are NameNode
failures, DataNode failures and network partitions.
** Data Disk Failure, Heartbeats and Re-Replication
Each DataNode sends a Heartbeat message to the NameNode periodically. A
network partition can cause a subset of DataNodes to lose connectivity
with the NameNode. The NameNode detects this condition by the absence
of a Heartbeat message. The NameNode marks DataNodes without recent
Heartbeats as dead and does not forward any new IO requests to them.
Any data that was registered to a dead DataNode is not available to
HDFS any more. DataNode death may cause the replication factor of some
blocks to fall below their specified value. The NameNode constantly
tracks which blocks need to be replicated and initiates replication
whenever necessary. The necessity for re-replication may arise due to
many reasons: a DataNode may become unavailable, a replica may become
corrupted, a hard disk on a DataNode may fail, or the replication
factor of a file may be increased.
** Cluster Rebalancing
The HDFS architecture is compatible with data rebalancing schemes. A
scheme might automatically move data from one DataNode to another if
the free space on a DataNode falls below a certain threshold. In the
event of a sudden high demand for a particular file, a scheme might
dynamically create additional replicas and rebalance other data in the
cluster. These types of data rebalancing schemes are not yet
implemented.
** Data Integrity
It is possible that a block of data fetched from a DataNode arrives
corrupted. This corruption can occur because of faults in a storage
device, network faults, or buggy software. The HDFS client software
implements checksum checking on the contents of HDFS files. When a
client creates an HDFS file, it computes a checksum of each block of
the file and stores these checksums in a separate hidden file in the
same HDFS namespace. When a client retrieves file contents it verifies
that the data it received from each DataNode matches the checksum
stored in the associated checksum file. If not, then the client can opt
to retrieve that block from another DataNode that has a replica of that
block.
** Metadata Disk Failure
The FsImage and the EditLog are central data structures of HDFS. A
corruption of these files can cause the HDFS instance to be
non-functional. For this reason, the NameNode can be configured to
support maintaining multiple copies of the FsImage and EditLog. Any
update to either the FsImage or EditLog causes each of the FsImages and
EditLogs to get updated synchronously. This synchronous updating of
multiple copies of the FsImage and EditLog may degrade the rate of
namespace transactions per second that a NameNode can support. However,
this degradation is acceptable because even though HDFS applications
are very data intensive in nature, they are not metadata intensive.
When a NameNode restarts, it selects the latest consistent FsImage and
EditLog to use.
The NameNode machine is a single point of failure for an HDFS cluster.
If the NameNode machine fails, manual intervention is necessary.
Currently, automatic restart and failover of the NameNode software to
another machine is not supported.
** Snapshots
Snapshots support storing a copy of data at a particular instant of
time. One usage of the snapshot feature may be to roll back a corrupted
HDFS instance to a previously known good point in time. HDFS does not
currently support snapshots but will in a future release.
* Data Organization
** Data Blocks
HDFS is designed to support very large files. Applications that are
compatible with HDFS are those that deal with large data sets. These
applications write their data only once but they read it one or more
times and require these reads to be satisfied at streaming speeds. HDFS
supports write-once-read-many semantics on files. A typical block size
used by HDFS is 64 MB. Thus, an HDFS file is chopped up into 64 MB
chunks, and if possible, each chunk will reside on a different
DataNode.
** Staging
A client request to create a file does not reach the NameNode
immediately. In fact, initially the HDFS client caches the file data
into a temporary local file. Application writes are transparently
redirected to this temporary local file. When the local file
accumulates data worth over one HDFS block size, the client contacts
the NameNode. The NameNode inserts the file name into the file system
hierarchy and allocates a data block for it. The NameNode responds to
the client request with the identity of the DataNode and the
destination data block. Then the client flushes the block of data from
the local temporary file to the specified DataNode. When a file is
closed, the remaining un-flushed data in the temporary local file is
transferred to the DataNode. The client then tells the NameNode that
the file is closed. At this point, the NameNode commits the file
creation operation into a persistent store. If the NameNode dies before
the file is closed, the file is lost.
The above approach has been adopted after careful consideration of
target applications that run on HDFS. These applications need streaming
writes to files. If a client writes to a remote file directly without
any client side buffering, the network speed and the congestion in the
network impacts throughput considerably. This approach is not without
precedent. Earlier distributed file systems, e.g. AFS, have used client
side caching to improve performance. A POSIX requirement has been
relaxed to achieve higher performance of data uploads.
** Replication Pipelining
When a client is writing data to an HDFS file, its data is first
written to a local file as explained in the previous section. Suppose
the HDFS file has a replication factor of three. When the local file
accumulates a full block of user data, the client retrieves a list of
DataNodes from the NameNode. This list contains the DataNodes that will
host a replica of that block. The client then flushes the data block to
the first DataNode. The first DataNode starts receiving the data in
small portions, writes each portion to its local repository and
transfers that portion to the second DataNode in the list. The second
DataNode, in turn starts receiving each portion of the data block,
writes that portion to its repository and then flushes that portion to
the third DataNode. Finally, the third DataNode writes the data to its
local repository. Thus, a DataNode can be receiving data from the
previous one in the pipeline and at the same time forwarding data to
the next one in the pipeline. Thus, the data is pipelined from one
DataNode to the next.
* Accessibility
HDFS can be accessed from applications in many different ways.
Natively, HDFS provides a
{{{http://hadoop.apache.org/docs/current/api/}FileSystem Java API}}
for applications to use. A C language wrapper for this Java API is also
available. In addition, an HTTP browser can also be used to browse the files
of an HDFS instance. Work is in progress to expose HDFS through the WebDAV
protocol.
** FS Shell
HDFS allows user data to be organized in the form of files and
directories. It provides a commandline interface called FS shell that
lets a user interact with the data in HDFS. The syntax of this command
set is similar to other shells (e.g. bash, csh) that users are already
familiar with. Here are some sample action/command pairs:
*---------+---------+
|| Action | Command
*---------+---------+
| Create a directory named <<</foodir>>> | <<<bin/hadoop dfs -mkdir /foodir>>>
*---------+---------+
| Remove a directory named <<</foodir>>> | <<<bin/hadoop fs -rm -R /foodir>>>
*---------+---------+
| View the contents of a file named <<</foodir/myfile.txt>>> | <<<bin/hadoop dfs -cat /foodir/myfile.txt>>>
*---------+---------+
FS shell is targeted for applications that need a scripting language to
interact with the stored data.
** DFSAdmin
The DFSAdmin command set is used for administering an HDFS cluster.
These are commands that are used only by an HDFS administrator. Here
are some sample action/command pairs:
*---------+---------+
|| Action | Command
*---------+---------+
|Put the cluster in Safemode | <<<bin/hdfs dfsadmin -safemode enter>>>
*---------+---------+
|Generate a list of DataNodes | <<<bin/hdfs dfsadmin -report>>>
*---------+---------+
|Recommission or decommission DataNode(s) | <<<bin/hdfs dfsadmin -refreshNodes>>>
*---------+---------+
** Browser Interface
A typical HDFS install configures a web server to expose the HDFS
namespace through a configurable TCP port. This allows a user to
navigate the HDFS namespace and view the contents of its files using a
web browser.
* Space Reclamation
** File Deletes and Undeletes
When a file is deleted by a user or an application, it is not
immediately removed from HDFS. Instead, HDFS first renames it to a file
in the <<</trash>>> directory. The file can be restored quickly as long as it
remains in <<</trash>>>. A file remains in <<</trash>>> for a configurable amount
of time. After the expiry of its life in <<</trash>>>, the NameNode deletes
the file from the HDFS namespace. The deletion of a file causes the
blocks associated with the file to be freed. Note that there could be
an appreciable time delay between the time a file is deleted by a user
and the time of the corresponding increase in free space in HDFS.
A user can Undelete a file after deleting it as long as it remains in
the <<</trash>>> directory. If a user wants to undelete a file that he/she
has deleted, he/she can navigate the <<</trash>>> directory and retrieve the
file. The <<</trash>>> directory contains only the latest copy of the file
that was deleted. The <<</trash>>> directory is just like any other directory
with one special feature: HDFS applies specified policies to
automatically delete files from this directory. Current default trash
interval is set to 0 (Deletes file without storing in trash). This value is
configurable parameter stored as <<<fs.trash.interval>>> stored in
core-site.xml.
** Decrease Replication Factor
When the replication factor of a file is reduced, the NameNode selects
excess replicas that can be deleted. The next Heartbeat transfers this
information to the DataNode. The DataNode then removes the
corresponding blocks and the corresponding free space appears in the
cluster. Once again, there might be a time delay between the completion
of the setReplication API call and the appearance of free space in the
cluster.
* References
Hadoop {{{http://hadoop.apache.org/docs/current/api/}JavaDoc API}}.
HDFS source code: {{http://hadoop.apache.org/version_control.html}}

View File

@ -1,104 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
Offline Edits Viewer Guide
---
Erik Steffl
---
${maven.build.timestamp}
Offline Edits Viewer Guide
%{toc|section=1|fromDepth=0}
* Overview
Offline Edits Viewer is a tool to parse the Edits log file. The current
processors are mostly useful for conversion between different formats,
including XML which is human readable and easier to edit than native
binary format.
The tool can parse the edits formats -18 (roughly Hadoop 0.19) and
later. The tool operates on files only, it does not need Hadoop cluster
to be running.
Input formats supported:
[[1]] <<binary>>: native binary format that Hadoop uses internally
[[2]] <<xml>>: XML format, as produced by xml processor, used if filename
has <<<.xml>>> (case insensitive) extension
The Offline Edits Viewer provides several output processors (unless
stated otherwise the output of the processor can be converted back to
original edits file):
[[1]] <<binary>>: native binary format that Hadoop uses internally
[[2]] <<xml>>: XML format
[[3]] <<stats>>: prints out statistics, this cannot be converted back to
Edits file
* Usage
----
bash$ bin/hdfs oev -i edits -o edits.xml
----
*-----------------------:-----------------------------------+
| Flag | Description |
*-----------------------:-----------------------------------+
|[<<<-i>>> ; <<<--inputFile>>>] <input file> | Specify the input edits log file to
| | process. Xml (case insensitive) extension means XML format otherwise
| | binary format is assumed. Required.
*-----------------------:-----------------------------------+
|[<<-o>> ; <<--outputFile>>] <output file> | Specify the output filename, if the
| | specified output processor generates one. If the specified file already
| | exists, it is silently overwritten. Required.
*-----------------------:-----------------------------------+
|[<<-p>> ; <<--processor>>] <processor> | Specify the image processor to apply
| | against the image file. Currently valid options are
| | <<<binary>>>, <<<xml>>> (default) and <<<stats>>>.
*-----------------------:-----------------------------------+
|<<[-v ; --verbose] >> | Print the input and output filenames and pipe output of
| | processor to console as well as specified file. On extremely large
| | files, this may increase processing time by an order of magnitude.
*-----------------------:-----------------------------------+
|<<[-h ; --help] >> | Display the tool usage and help information and exit.
*-----------------------:-----------------------------------+
* Case study: Hadoop cluster recovery
In case there is some problem with hadoop cluster and the edits file is
corrupted it is possible to save at least part of the edits file that
is correct. This can be done by converting the binary edits to XML,
edit it manually and then convert it back to binary. The most common
problem is that the edits file is missing the closing record (record
that has opCode -1). This should be recognized by the tool and the XML
format should be properly closed.
If there is no closing record in the XML file you can add one after
last correct record. Anything after the record with opCode -1 is
ignored.
Example of a closing record (with opCode -1):
+----
<RECORD>
<OPCODE>-1</OPCODE>
<DATA>
</DATA>
</RECORD>
+----

View File

@ -1,247 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
Offline Image Viewer Guide
---
---
${maven.build.timestamp}
Offline Image Viewer Guide
%{toc|section=1|fromDepth=0}
* Overview
The Offline Image Viewer is a tool to dump the contents of hdfs fsimage
files to a human-readable format and provide read-only WebHDFS API
in order to allow offline analysis and examination of an Hadoop cluster's
namespace. The tool is able to process very large image files relatively
quickly. The tool handles the layout formats that were included with Hadoop
versions 2.4 and up. If you want to handle older layout formats, you can
use the Offline Image Viewer of Hadoop 2.3 or {{oiv_legacy Command}}.
If the tool is not able to process an image file, it will exit cleanly.
The Offline Image Viewer does not require a Hadoop cluster to be running;
it is entirely offline in its operation.
The Offline Image Viewer provides several output processors:
[[1]] Web is the default output processor. It launches a HTTP server
that exposes read-only WebHDFS API. Users can investigate the namespace
interactively by using HTTP REST API.
[[2]] XML creates an XML document of the fsimage and includes all of the
information within the fsimage, similar to the lsr processor. The
output of this processor is amenable to automated processing and
analysis with XML tools. Due to the verbosity of the XML syntax,
this processor will also generate the largest amount of output.
[[3]] FileDistribution is the tool for analyzing file sizes in the
namespace image. In order to run the tool one should define a range
of integers [0, maxSize] by specifying maxSize and a step. The
range of integers is divided into segments of size step: [0, s[1],
..., s[n-1], maxSize], and the processor calculates how many files
in the system fall into each segment [s[i-1], s[i]). Note that
files larger than maxSize always fall into the very last segment.
The output file is formatted as a tab separated two column table:
Size and NumFiles. Where Size represents the start of the segment,
and numFiles is the number of files form the image which size falls
in this segment.
* Usage
** Web Processor
Web processor launches a HTTP server which exposes read-only WebHDFS API.
Users can specify the address to listen by -addr option (default by
localhost:5978).
----
bash$ bin/hdfs oiv -i fsimage
14/04/07 13:25:14 INFO offlineImageViewer.WebImageViewer: WebImageViewer
started. Listening on /127.0.0.1:5978. Press Ctrl+C to stop the viewer.
----
Users can access the viewer and get the information of the fsimage by
the following shell command:
----
bash$ bin/hdfs dfs -ls webhdfs://127.0.0.1:5978/
Found 2 items
drwxrwx--- - root supergroup 0 2014-03-26 20:16 webhdfs://127.0.0.1:5978/tmp
drwxr-xr-x - root supergroup 0 2014-03-31 14:08 webhdfs://127.0.0.1:5978/user
----
To get the information of all the files and directories, you can simply use
the following command:
----
bash$ bin/hdfs dfs -ls -R webhdfs://127.0.0.1:5978/
----
Users can also get JSON formatted FileStatuses via HTTP REST API.
----
bash$ curl -i http://127.0.0.1:5978/webhdfs/v1/?op=liststatus
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 252
{"FileStatuses":{"FileStatus":[
{"fileId":16386,"accessTime":0,"replication":0,"owner":"theuser","length":0,"permission":"755","blockSize":0,"modificationTime":1392772497282,"type":"DIRECTORY","group":"supergroup","childrenNum":1,"pathSuffix":"user"}
]}}
----
The Web processor now supports the following operations:
* {{{./WebHDFS.html#List_a_Directory}LISTSTATUS}}
* {{{./WebHDFS.html#Status_of_a_FileDirectory}GETFILESTATUS}}
* {{{./WebHDFS.html#Get_ACL_Status}GETACLSTATUS}}
** XML Processor
XML Processor is used to dump all the contents in the fsimage. Users can
specify input and output file via -i and -o command-line.
----
bash$ bin/hdfs oiv -p XML -i fsimage -o fsimage.xml
----
This will create a file named fsimage.xml contains all the information in
the fsimage. For very large image files, this process may take several
minutes.
Applying the Offline Image Viewer with XML processor would result in the
following output:
----
<?xml version="1.0"?>
<fsimage>
<NameSection>
<genstampV1>1000</genstampV1>
<genstampV2>1002</genstampV2>
<genstampV1Limit>0</genstampV1Limit>
<lastAllocatedBlockId>1073741826</lastAllocatedBlockId>
<txid>37</txid>
</NameSection>
<INodeSection>
<lastInodeId>16400</lastInodeId>
<inode>
<id>16385</id>
<type>DIRECTORY</type>
<name></name>
<mtime>1392772497282</mtime>
<permission>theuser:supergroup:rwxr-xr-x</permission>
<nsquota>9223372036854775807</nsquota>
<dsquota>-1</dsquota>
</inode>
...remaining output omitted...
----
* Options
*-----------------------:-----------------------------------+
| <<Flag>> | <<Description>> |
*-----------------------:-----------------------------------+
| <<<-i>>>\|<<<--inputFile>>> <input file> | Specify the input fsimage file
| | to process. Required.
*-----------------------:-----------------------------------+
| <<<-o>>>\|<<<--outputFile>>> <output file> | Specify the output filename,
| | if the specified output processor generates one. If
| | the specified file already exists, it is silently
| | overwritten. (output to stdout by default)
*-----------------------:-----------------------------------+
| <<<-p>>>\|<<<--processor>>> <processor> | Specify the image processor to
| | apply against the image file. Currently valid options
| | are Web (default), XML and FileDistribution.
*-----------------------:-----------------------------------+
| <<<-addr>>> <address> | Specify the address(host:port) to listen.
| | (localhost:5978 by default). This option is used with
| | Web processor.
*-----------------------:-----------------------------------+
| <<<-maxSize>>> <size> | Specify the range [0, maxSize] of file sizes to be
| | analyzed in bytes (128GB by default). This option is
| | used with FileDistribution processor.
*-----------------------:-----------------------------------+
| <<<-step>>> <size> | Specify the granularity of the distribution in bytes
| | (2MB by default). This option is used with
| | FileDistribution processor.
*-----------------------:-----------------------------------+
| <<<-h>>>\|<<<--help>>>| Display the tool usage and help information and
| | exit.
*-----------------------:-----------------------------------+
* Analyzing Results
The Offline Image Viewer makes it easy to gather large amounts of data
about the hdfs namespace. This information can then be used to explore
file system usage patterns or find specific files that match arbitrary
criteria, along with other types of namespace analysis.
* oiv_legacy Command
Due to the internal layout changes introduced by the ProtocolBuffer-based
fsimage ({{{https://issues.apache.org/jira/browse/HDFS-5698}HDFS-5698}}),
OfflineImageViewer consumes excessive amount of memory and loses some
functions such as Indented and Delimited processor. If you want to process
without large amount of memory or use these processors, you can use
<<<oiv_legacy>>> command (same as <<<oiv>>> in Hadoop 2.3).
** Usage
1. Set <<<dfs.namenode.legacy-oiv-image.dir>>> to an appropriate directory
to make standby NameNode or SecondaryNameNode save its namespace in the
old fsimage format during checkpointing.
2. Use <<<oiv_legacy>>> command to the old format fsimage.
----
bash$ bin/hdfs oiv_legacy -i fsimage_old -o output
----
** Options
*-----------------------:-----------------------------------+
| <<Flag>> | <<Description>> |
*-----------------------:-----------------------------------+
| <<<-i>>>\|<<<--inputFile>>> <input file> | Specify the input fsimage file to
| | process. Required.
*-----------------------:-----------------------------------+
| <<<-o>>>\|<<<--outputFile>>> <output file> | Specify the output filename, if
| | the specified output processor generates one. If the
| | specified file already exists, it is silently
| | overwritten. Required.
*-----------------------:-----------------------------------+
| <<<-p>>>\|<<<--processor>>> <processor> | Specify the image processor to
| | apply against the image file. Valid options are
| | Ls (default), XML, Delimited, Indented, and
| | FileDistribution.
*-----------------------:-----------------------------------+
| <<<-skipBlocks>>> | Do not enumerate individual blocks within files. This
| | may save processing time and outfile file space on
| | namespaces with very large files. The Ls processor
| | reads the blocks to correctly determine file sizes
| | and ignores this option.
*-----------------------:-----------------------------------+
| <<<-printToScreen>>> | Pipe output of processor to console as well as
| | specified file. On extremely large namespaces, this
| | may increase processing time by an order of
| | magnitude.
*-----------------------:-----------------------------------+
| <<<-delimiter>>> <arg>| When used in conjunction with the Delimited
| | processor, replaces the default tab delimiter with
| | the string specified by <arg>.
*-----------------------:-----------------------------------+
| <<<-h>>>\|<<<--help>>>| Display the tool usage and help information and exit.
*-----------------------:-----------------------------------+

View File

@ -1,145 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
Hadoop Distributed File System-${project.version} - Support for Multi-Homed Networks
---
---
${maven.build.timestamp}
HDFS Support for Multihomed Networks
This document is targetted to cluster administrators deploying <<<HDFS>>> in
multihomed networks. Similar support for <<<YARN>>>/<<<MapReduce>>> is
work in progress and will be documented when available.
%{toc|section=1|fromDepth=0}
* Multihoming Background
In multihomed networks the cluster nodes are connected to more than one
network interface. There could be multiple reasons for doing so.
[[1]] <<Security>>: Security requirements may dictate that intra-cluster
traffic be confined to a different network than the network used to
transfer data in and out of the cluster.
[[2]] <<Performance>>: Intra-cluster traffic may use one or more high bandwidth
interconnects like Fiber Channel, Infiniband or 10GbE.
[[3]] <<Failover/Redundancy>>: The nodes may have multiple network adapters
connected to a single network to handle network adapter failure.
Note that NIC Bonding (also known as NIC Teaming or Link
Aggregation) is a related but separate topic. The following settings
are usually not applicable to a NIC bonding configuration which handles
multiplexing and failover transparently while presenting a single 'logical
network' to applications.
* Fixing Hadoop Issues In Multihomed Environments
** Ensuring HDFS Daemons Bind All Interfaces
By default <<<HDFS>>> endpoints are specified as either hostnames or IP addresses.
In either case <<<HDFS>>> daemons will bind to a single IP address making
the daemons unreachable from other networks.
The solution is to have separate setting for server endpoints to force binding
the wildcard IP address <<<INADDR_ANY>>> i.e. <<<0.0.0.0>>>. Do NOT supply a port
number with any of these settings.
----
<property>
<name>dfs.namenode.rpc-bind-host</name>
<value>0.0.0.0</value>
<description>
The actual address the RPC server will bind to. If this optional address is
set, it overrides only the hostname portion of dfs.namenode.rpc-address.
It can also be specified per name node or name service for HA/Federation.
This is useful for making the name node listen on all interfaces by
setting it to 0.0.0.0.
</description>
</property>
<property>
<name>dfs.namenode.servicerpc-bind-host</name>
<value>0.0.0.0</value>
<description>
The actual address the service RPC server will bind to. If this optional address is
set, it overrides only the hostname portion of dfs.namenode.servicerpc-address.
It can also be specified per name node or name service for HA/Federation.
This is useful for making the name node listen on all interfaces by
setting it to 0.0.0.0.
</description>
</property>
<property>
<name>dfs.namenode.http-bind-host</name>
<value>0.0.0.0</value>
<description>
The actual adress the HTTP server will bind to. If this optional address
is set, it overrides only the hostname portion of dfs.namenode.http-address.
It can also be specified per name node or name service for HA/Federation.
This is useful for making the name node HTTP server listen on all
interfaces by setting it to 0.0.0.0.
</description>
</property>
<property>
<name>dfs.namenode.https-bind-host</name>
<value>0.0.0.0</value>
<description>
The actual adress the HTTPS server will bind to. If this optional address
is set, it overrides only the hostname portion of dfs.namenode.https-address.
It can also be specified per name node or name service for HA/Federation.
This is useful for making the name node HTTPS server listen on all
interfaces by setting it to 0.0.0.0.
</description>
</property>
----
** Clients use Hostnames when connecting to DataNodes
By default <<<HDFS>>> clients connect to DataNodes using the IP address
provided by the NameNode. Depending on the network configuration this
IP address may be unreachable by the clients. The fix is letting clients perform
their own DNS resolution of the DataNode hostname. The following setting
enables this behavior.
----
<property>
<name>dfs.client.use.datanode.hostname</name>
<value>true</value>
<description>Whether clients should use datanode hostnames when
connecting to datanodes.
</description>
</property>
----
** DataNodes use HostNames when connecting to other DataNodes
Rarely, the NameNode-resolved IP address for a DataNode may be unreachable
from other DataNodes. The fix is to force DataNodes to perform their own
DNS resolution for inter-DataNode connections. The following setting enables
this behavior.
----
<property>
<name>dfs.datanode.use.datanode.hostname</name>
<value>true</value>
<description>Whether datanodes should use datanode hostnames when
connecting to other datanodes for data transfer.
</description>
</property>
----

View File

@ -1,364 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
Hadoop Distributed File System-${project.version} - HDFS NFS Gateway
---
---
${maven.build.timestamp}
HDFS NFS Gateway
%{toc|section=1|fromDepth=0}
* {Overview}
The NFS Gateway supports NFSv3 and allows HDFS to be mounted as part of the client's local file system.
Currently NFS Gateway supports and enables the following usage patterns:
* Users can browse the HDFS file system through their local file system
on NFSv3 client compatible operating systems.
* Users can download files from the the HDFS file system on to their
local file system.
* Users can upload files from their local file system directly to the
HDFS file system.
* Users can stream data directly to HDFS through the mount point. File
append is supported but random write is not supported.
The NFS gateway machine needs the same thing to run an HDFS client like Hadoop JAR files, HADOOP_CONF directory.
The NFS gateway can be on the same host as DataNode, NameNode, or any HDFS client.
* {Configuration}
The NFS-gateway uses proxy user to proxy all the users accessing the NFS mounts.
In non-secure mode, the user running the gateway is the proxy user, while in secure mode the
user in Kerberos keytab is the proxy user. Suppose the proxy user is 'nfsserver'
and users belonging to the groups 'users-group1'
and 'users-group2' use the NFS mounts, then in core-site.xml of the NameNode, the following
two properities must be set and only NameNode needs restart after the configuration change
(NOTE: replace the string 'nfsserver' with the proxy user name in your cluster):
----
<property>
<name>hadoop.proxyuser.nfsserver.groups</name>
<value>root,users-group1,users-group2</value>
<description>
The 'nfsserver' user is allowed to proxy all members of the 'users-group1' and
'users-group2' groups. Note that in most cases you will need to include the
group "root" because the user "root" (which usually belonges to "root" group) will
generally be the user that initially executes the mount on the NFS client system.
Set this to '*' to allow nfsserver user to proxy any group.
</description>
</property>
----
----
<property>
<name>hadoop.proxyuser.nfsserver.hosts</name>
<value>nfs-client-host1.com</value>
<description>
This is the host where the nfs gateway is running. Set this to '*' to allow
requests from any hosts to be proxied.
</description>
</property>
----
The above are the only required configuration for the NFS gateway in non-secure mode. For Kerberized
hadoop clusters, the following configurations need to be added to hdfs-site.xml for the gateway (NOTE: replace
string "nfsserver" with the proxy user name and ensure the user contained in the keytab is
also the same proxy user):
----
<property>
<name>nfs.keytab.file</name>
<value>/etc/hadoop/conf/nfsserver.keytab</value> <!-- path to the nfs gateway keytab -->
</property>
----
----
<property>
<name>nfs.kerberos.principal</name>
<value>nfsserver/_HOST@YOUR-REALM.COM</value>
</property>
----
The rest of the NFS gateway configurations are optional for both secure and non-secure mode.
The AIX NFS client has a {{{https://issues.apache.org/jira/browse/HDFS-6549}few known issues}}
that prevent it from working correctly by default with the HDFS NFS
Gateway. If you want to be able to access the HDFS NFS Gateway from AIX, you
should set the following configuration setting to enable work-arounds for these
issues:
----
<property>
<name>nfs.aix.compatibility.mode.enabled</name>
<value>true</value>
</property>
----
Note that regular, non-AIX clients should NOT enable AIX compatibility mode.
The work-arounds implemented by AIX compatibility mode effectively disable
safeguards to ensure that listing of directory contents via NFS returns
consistent results, and that all data sent to the NFS server can be assured to
have been committed.
It's strongly recommended for the users to update a few configuration properties based on their use
cases. All the following configuration properties can be added or updated in hdfs-site.xml.
* If the client mounts the export with access time update allowed, make sure the following
property is not disabled in the configuration file. Only NameNode needs to restart after
this property is changed. On some Unix systems, the user can disable access time update
by mounting the export with "noatime". If the export is mounted with "noatime", the user
doesn't need to change the following property and thus no need to restart namenode.
----
<property>
<name>dfs.namenode.accesstime.precision</name>
<value>3600000</value>
<description>The access time for HDFS file is precise upto this value.
The default value is 1 hour. Setting a value of 0 disables
access times for HDFS.
</description>
</property>
----
* Users are expected to update the file dump directory. NFS client often
reorders writes. Sequential writes can arrive at the NFS gateway at random
order. This directory is used to temporarily save out-of-order writes
before writing to HDFS. For each file, the out-of-order writes are dumped after
they are accumulated to exceed certain threshold (e.g., 1MB) in memory.
One needs to make sure the directory has enough
space. For example, if the application uploads 10 files with each having
100MB, it is recommended for this directory to have roughly 1GB space in case if a
worst-case write reorder happens to every file. Only NFS gateway needs to restart after
this property is updated.
----
<property>
<name>nfs.dump.dir</name>
<value>/tmp/.hdfs-nfs</value>
</property>
----
* By default, the export can be mounted by any client. To better control the access,
users can update the following property. The value string contains machine name and
access privilege, separated by whitespace
characters. The machine name format can be a single host, a Java regular expression, or an IPv4 address. The
access privilege uses rw or ro to specify read/write or read-only access of the machines to exports. If the access
privilege is not provided, the default is read-only. Entries are separated by ";".
For example: "192.168.0.0/22 rw ; host.*\.example\.com ; host1.test.org ro;". Only the NFS gateway needs to restart after
this property is updated.
----
<property>
<name>nfs.exports.allowed.hosts</name>
<value>* rw</value>
</property>
----
* JVM and log settings. You can export JVM settings (e.g., heap size and GC log) in
HADOOP_NFS3_OPTS. More NFS related settings can be found in hadoop-env.sh.
To get NFS debug trace, you can edit the log4j.property file
to add the following. Note, debug trace, especially for ONCRPC, can be very verbose.
To change logging level:
-----------------------------------------------
log4j.logger.org.apache.hadoop.hdfs.nfs=DEBUG
-----------------------------------------------
To get more details of ONCRPC requests:
-----------------------------------------------
log4j.logger.org.apache.hadoop.oncrpc=DEBUG
-----------------------------------------------
* {Start and stop NFS gateway service}
Three daemons are required to provide NFS service: rpcbind (or portmap), mountd and nfsd.
The NFS gateway process has both nfsd and mountd. It shares the HDFS root "/" as the
only export. It is recommended to use the portmap included in NFS gateway package. Even
though NFS gateway works with portmap/rpcbind provide by most Linux distributions, the
package included portmap is needed on some Linux systems such as REHL6.2 due to an
{{{https://bugzilla.redhat.com/show_bug.cgi?id=731542}rpcbind bug}}. More detailed discussions can
be found in {{{https://issues.apache.org/jira/browse/HDFS-4763}HDFS-4763}}.
[[1]] Stop nfsv3 and rpcbind/portmap services provided by the platform (commands can be different on various Unix platforms):
-------------------------
[root]> service nfs stop
[root]> service rpcbind stop
-------------------------
[[2]] Start Hadoop's portmap (needs root privileges):
-------------------------
[root]> $HADOOP_PREFIX/bin/hdfs --daemon start portmap
-------------------------
[[3]] Start mountd and nfsd.
No root privileges are required for this command. In non-secure mode, the NFS gateway
should be started by the proxy user mentioned at the beginning of this user guide.
While in secure mode, any user can start NFS gateway
as long as the user has read access to the Kerberos keytab defined in "nfs.keytab.file".
-------------------------
[hdfs]$ $HADOOP_PREFIX/bin/hdfs --daemon start nfs3
-------------------------
[[4]] Stop NFS gateway services.
-------------------------
[hdfs]$ $HADOOP_PREFIX/bin/hdfs --daemon stop nfs3
[root]> $HADOOP_PREFIX/bin/hdfs --daemon stop portmap
-------------------------
Optionally, you can forgo running the Hadoop-provided portmap daemon and
instead use the system portmap daemon on all operating systems if you start the
NFS Gateway as root. This will allow the HDFS NFS Gateway to work around the
aforementioned bug and still register using the system portmap daemon. To do
so, just start the NFS gateway daemon as you normally would, but make sure to
do so as the "root" user, and also set the "HADOOP_PRIVILEGED_NFS_USER"
environment variable to an unprivileged user. In this mode the NFS Gateway will
start as root to perform its initial registration with the system portmap, and
then will drop privileges back to the user specified by the
HADOOP_PRIVILEGED_NFS_USER afterward and for the rest of the duration of the
lifetime of the NFS Gateway process. Note that if you choose this route, you
should skip steps 1 and 2 above.
* {Verify validity of NFS related services}
[[1]] Execute the following command to verify if all the services are up and running:
-------------------------
[root]> rpcinfo -p $nfs_server_ip
-------------------------
You should see output similar to the following:
-------------------------
program vers proto port
100005 1 tcp 4242 mountd
100005 2 udp 4242 mountd
100005 2 tcp 4242 mountd
100000 2 tcp 111 portmapper
100000 2 udp 111 portmapper
100005 3 udp 4242 mountd
100005 1 udp 4242 mountd
100003 3 tcp 2049 nfs
100005 3 tcp 4242 mountd
-------------------------
[[2]] Verify if the HDFS namespace is exported and can be mounted.
-------------------------
[root]> showmount -e $nfs_server_ip
-------------------------
You should see output similar to the following:
-------------------------
Exports list on $nfs_server_ip :
/ (everyone)
-------------------------
* {Mount the export “/”}
Currently NFS v3 only uses TCP as the transportation protocol.
NLM is not supported so mount option "nolock" is needed. It's recommended to use
hard mount. This is because, even after the client sends all data to
NFS gateway, it may take NFS gateway some extra time to transfer data to HDFS
when writes were reorderd by NFS client Kernel.
If soft mount has to be used, the user should give it a relatively
long timeout (at least no less than the default timeout on the host) .
The users can mount the HDFS namespace as shown below:
-------------------------------------------------------------------
[root]>mount -t nfs -o vers=3,proto=tcp,nolock,noacl $server:/ $mount_point
-------------------------------------------------------------------
Then the users can access HDFS as part of the local file system except that,
hard link and random write are not supported yet. To optimize the performance
of large file I/O, one can increase the NFS transfer size(rsize and wsize) during mount.
By default, NFS gateway supports 1MB as the maximum transfer size. For larger data
transfer size, one needs to update "nfs.rtmax" and "nfs.rtmax" in hdfs-site.xml.
* {Allow mounts from unprivileged clients}
In environments where root access on client machines is not generally
available, some measure of security can be obtained by ensuring that only NFS
clients originating from privileged ports can connect to the NFS server. This
feature is referred to as "port monitoring." This feature is not enabled by default
in the HDFS NFS Gateway, but can be optionally enabled by setting the
following config in hdfs-site.xml on the NFS Gateway machine:
-------------------------------------------------------------------
<property>
<name>nfs.port.monitoring.disabled</name>
<value>false</value>
</property>
-------------------------------------------------------------------
* {User authentication and mapping}
NFS gateway in this release uses AUTH_UNIX style authentication. When the user on NFS client
accesses the mount point, NFS client passes the UID to NFS gateway.
NFS gateway does a lookup to find user name from the UID, and then passes the
username to the HDFS along with the HDFS requests.
For example, if the NFS client has current user as "admin", when the user accesses
the mounted directory, NFS gateway will access HDFS as user "admin". To access HDFS
as the user "hdfs", one needs to switch the current user to "hdfs" on the client system
when accessing the mounted directory.
The system administrator must ensure that the user on NFS client host has the same
name and UID as that on the NFS gateway host. This is usually not a problem if
the same user management system (e.g., LDAP/NIS) is used to create and deploy users on
HDFS nodes and NFS client node. In case the user account is created manually on different hosts, one might need to
modify UID (e.g., do "usermod -u 123 myusername") on either NFS client or NFS gateway host
in order to make it the same on both sides. More technical details of RPC AUTH_UNIX can be found
in {{{http://tools.ietf.org/html/rfc1057}RPC specification}}.
Optionally, the system administrator can configure a custom static mapping
file in the event one wishes to access the HDFS NFS Gateway from a system with
a completely disparate set of UIDs/GIDs. By default this file is located at
"/etc/nfs.map", but a custom location can be configured by setting the
"static.id.mapping.file" property to the path of the static mapping file.
The format of the static mapping file is similar to what is described in the
exports(5) manual page, but roughly it is:
-------------------------
# Mapping for clients accessing the NFS gateway
uid 10 100 # Map the remote UID 10 the local UID 100
gid 11 101 # Map the remote GID 11 to the local GID 101
-------------------------

View File

@ -1,438 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
HDFS Permissions Guide
---
---
${maven.build.timestamp}
HDFS Permissions Guide
%{toc|section=1|fromDepth=0}
* Overview
The Hadoop Distributed File System (HDFS) implements a permissions
model for files and directories that shares much of the POSIX model.
Each file and directory is associated with an owner and a group. The
file or directory has separate permissions for the user that is the
owner, for other users that are members of the group, and for all other
users. For files, the r permission is required to read the file, and
the w permission is required to write or append to the file. For
directories, the r permission is required to list the contents of the
directory, the w permission is required to create or delete files or
directories, and the x permission is required to access a child of the
directory.
In contrast to the POSIX model, there are no setuid or setgid bits for
files as there is no notion of executable files. For directories, there
are no setuid or setgid bits directory as a simplification. The Sticky
bit can be set on directories, preventing anyone except the superuser,
directory owner or file owner from deleting or moving the files within
the directory. Setting the sticky bit for a file has no effect.
Collectively, the permissions of a file or directory are its mode. In
general, Unix customs for representing and displaying modes will be
used, including the use of octal numbers in this description. When a
file or directory is created, its owner is the user identity of the
client process, and its group is the group of the parent directory (the
BSD rule).
HDFS also provides optional support for POSIX ACLs (Access Control Lists) to
augment file permissions with finer-grained rules for specific named users or
named groups. ACLs are discussed in greater detail later in this document.
Each client process that accesses HDFS has a two-part identity composed
of the user name, and groups list. Whenever HDFS must do a permissions
check for a file or directory foo accessed by a client process,
* If the user name matches the owner of foo, then the owner
permissions are tested;
* Else if the group of foo matches any of member of the groups list,
then the group permissions are tested;
* Otherwise the other permissions of foo are tested.
If a permissions check fails, the client operation fails.
* User Identity
As of Hadoop 0.22, Hadoop supports two different modes of operation to
determine the user's identity, specified by the
hadoop.security.authentication property:
* <<simple>>
In this mode of operation, the identity of a client process is
determined by the host operating system. On Unix-like systems,
the user name is the equivalent of `whoami`.
* <<kerberos>>
In Kerberized operation, the identity of a client process is
determined by its Kerberos credentials. For example, in a
Kerberized environment, a user may use the kinit utility to
obtain a Kerberos ticket-granting-ticket (TGT) and use klist to
determine their current principal. When mapping a Kerberos
principal to an HDFS username, all components except for the
primary are dropped. For example, a principal
todd/foobar@CORP.COMPANY.COM will act as the simple username
todd on HDFS.
Regardless of the mode of operation, the user identity mechanism is
extrinsic to HDFS itself. There is no provision within HDFS for
creating user identities, establishing groups, or processing user
credentials.
* Group Mapping
Once a username has been determined as described above, the list of
groups is determined by a group mapping service, configured by the
hadoop.security.group.mapping property. The default implementation,
org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback,
will determine if the Java Native Interface (JNI) is available. If
JNI is available, the implementation will use the API within hadoop
to resolve a list of groups for a user. If JNI is not available
then the shell implementation,
org.apache.hadoop.security.ShellBasedUnixGroupsMapping, is used.
This implementation shells out with the <<<bash -c groups>>>
command (for a Linux/Unix environment) or the <<<net group>>>
command (for a Windows environment) to resolve a list of groups for
a user.
An alternate implementation, which connects directly to an LDAP server
to resolve the list of groups, is available via
org.apache.hadoop.security.LdapGroupsMapping. However, this provider
should only be used if the required groups reside exclusively in LDAP,
and are not materialized on the Unix servers. More information on
configuring the group mapping service is available in the Javadocs.
For HDFS, the mapping of users to groups is performed on the NameNode.
Thus, the host system configuration of the NameNode determines the
group mappings for the users.
Note that HDFS stores the user and group of a file or directory as
strings; there is no conversion from user and group identity numbers as
is conventional in Unix.
* Understanding the Implementation
Each file or directory operation passes the full path name to the name
node, and the permissions checks are applied along the path for each
operation. The client framework will implicitly associate the user
identity with the connection to the name node, reducing the need for
changes to the existing client API. It has always been the case that
when one operation on a file succeeds, the operation might fail when
repeated because the file, or some directory on the path, no longer
exists. For instance, when the client first begins reading a file, it
makes a first request to the name node to discover the location of the
first blocks of the file. A second request made to find additional
blocks may fail. On the other hand, deleting a file does not revoke
access by a client that already knows the blocks of the file. With the
addition of permissions, a client's access to a file may be withdrawn
between requests. Again, changing permissions does not revoke the
access of a client that already knows the file's blocks.
* Changes to the File System API
All methods that use a path parameter will throw <<<AccessControlException>>>
if permission checking fails.
New methods:
* <<<public FSDataOutputStream create(Path f, FsPermission permission,
boolean overwrite, int bufferSize, short replication, long
blockSize, Progressable progress) throws IOException;>>>
* <<<public boolean mkdirs(Path f, FsPermission permission) throws
IOException;>>>
* <<<public void setPermission(Path p, FsPermission permission) throws
IOException;>>>
* <<<public void setOwner(Path p, String username, String groupname)
throws IOException;>>>
* <<<public FileStatus getFileStatus(Path f) throws IOException;>>>
will additionally return the user, group and mode associated with the
path.
The mode of a new file or directory is restricted my the umask set as a
configuration parameter. When the existing <<<create(path, …)>>> method
(without the permission parameter) is used, the mode of the new file is
<<<0666 & ^umask>>>. When the new <<<create(path, permission, …)>>> method
(with the permission parameter P) is used, the mode of the new file is
<<<P & ^umask & 0666>>>. When a new directory is created with the existing
<<<mkdirs(path)>>>
method (without the permission parameter), the mode of the new
directory is <<<0777 & ^umask>>>. When the new <<<mkdirs(path, permission)>>>
method (with the permission parameter P) is used, the mode of new
directory is <<<P & ^umask & 0777>>>.
* Changes to the Application Shell
New operations:
* <<<chmod [-R] mode file …>>>
Only the owner of a file or the super-user is permitted to change
the mode of a file.
* <<<chgrp [-R] group file …>>>
The user invoking chgrp must belong to the specified group and be
the owner of the file, or be the super-user.
* <<<chown [-R] [owner][:[group]] file …>>>
The owner of a file may only be altered by a super-user.
* <<<ls file …>>>
* <<<lsr file …>>>
The output is reformatted to display the owner, group and mode.
* The Super-User
The super-user is the user with the same identity as name node process
itself. Loosely, if you started the name node, then you are the
super-user. The super-user can do anything in that permissions checks
never fail for the super-user. There is no persistent notion of who was
the super-user; when the name node is started the process identity
determines who is the super-user for now. The HDFS super-user does not
have to be the super-user of the name node host, nor is it necessary
that all clusters have the same super-user. Also, an experimenter
running HDFS on a personal workstation, conveniently becomes that
installation's super-user without any configuration.
In addition, the administrator my identify a distinguished group using
a configuration parameter. If set, members of this group are also
super-users.
* The Web Server
By default, the identity of the web server is a configuration
parameter. That is, the name node has no notion of the identity of the
real user, but the web server behaves as if it has the identity (user
and groups) of a user chosen by the administrator. Unless the chosen
identity matches the super-user, parts of the name space may be
inaccessible to the web server.
* ACLs (Access Control Lists)
In addition to the traditional POSIX permissions model, HDFS also supports
POSIX ACLs (Access Control Lists). ACLs are useful for implementing
permission requirements that differ from the natural organizational hierarchy
of users and groups. An ACL provides a way to set different permissions for
specific named users or named groups, not only the file's owner and the
file's group.
By default, support for ACLs is disabled, and the NameNode disallows creation
of ACLs. To enable support for ACLs, set <<<dfs.namenode.acls.enabled>>> to
true in the NameNode configuration.
An ACL consists of a set of ACL entries. Each ACL entry names a specific
user or group and grants or denies read, write and execute permissions for
that specific user or group. For example:
+--
user::rw-
user:bruce:rwx #effective:r--
group::r-x #effective:r--
group:sales:rwx #effective:r--
mask::r--
other::r--
+--
ACL entries consist of a type, an optional name and a permission string.
For display purposes, ':' is used as the delimiter between each field. In
this example ACL, the file owner has read-write access, the file group has
read-execute access and others have read access. So far, this is equivalent
to setting the file's permission bits to 654.
Additionally, there are 2 extended ACL entries for the named user bruce and
the named group sales, both granted full access. The mask is a special ACL
entry that filters the permissions granted to all named user entries and
named group entries, and also the unnamed group entry. In the example, the
mask has only read permissions, and we can see that the effective permissions
of several ACL entries have been filtered accordingly.
Every ACL must have a mask. If the user doesn't supply a mask while setting
an ACL, then a mask is inserted automatically by calculating the union of
permissions on all entries that would be filtered by the mask.
Running <<<chmod>>> on a file that has an ACL actually changes the
permissions of the mask. Since the mask acts as a filter, this effectively
constrains the permissions of all extended ACL entries instead of changing
just the group entry and possibly missing other extended ACL entries.
The model also differentiates between an "access ACL", which defines the
rules to enforce during permission checks, and a "default ACL", which defines
the ACL entries that new child files or sub-directories receive automatically
during creation. For example:
+--
user::rwx
group::r-x
other::r-x
default:user::rwx
default:user:bruce:rwx #effective:r-x
default:group::r-x
default:group:sales:rwx #effective:r-x
default:mask::r-x
default:other::r-x
+--
Only directories may have a default ACL. When a new file or sub-directory is
created, it automatically copies the default ACL of its parent into its own
access ACL. A new sub-directory also copies it to its own default ACL. In
this way, the default ACL will be copied down through arbitrarily deep levels
of the file system tree as new sub-directories get created.
The exact permission values in the new child's access ACL are subject to
filtering by the mode parameter. Considering the default umask of 022, this
is typically 755 for new directories and 644 for new files. The mode
parameter filters the copied permission values for the unnamed user (file
owner), the mask and other. Using this particular example ACL, and creating
a new sub-directory with 755 for the mode, this mode filtering has no effect
on the final result. However, if we consider creation of a file with 644 for
the mode, then mode filtering causes the new file's ACL to receive read-write
for the unnamed user (file owner), read for the mask and read for others.
This mask also means that effective permissions for named user bruce and
named group sales are only read.
Note that the copy occurs at time of creation of the new file or
sub-directory. Subsequent changes to the parent's default ACL do not change
existing children.
The default ACL must have all minimum required ACL entries, including the
unnamed user (file owner), unnamed group (file group) and other entries. If
the user doesn't supply one of these entries while setting a default ACL,
then the entries are inserted automatically by copying the corresponding
permissions from the access ACL, or permission bits if there is no access
ACL. The default ACL also must have mask. As described above, if the mask
is unspecified, then a mask is inserted automatically by calculating the
union of permissions on all entries that would be filtered by the mask.
When considering a file that has an ACL, the algorithm for permission checks
changes to:
* If the user name matches the owner of file, then the owner
permissions are tested;
* Else if the user name matches the name in one of the named user entries,
then these permissions are tested, filtered by the mask permissions;
* Else if the group of file matches any member of the groups list,
and if these permissions filtered by the mask grant access, then these
permissions are used;
* Else if there is a named group entry matching a member of the groups list,
and if these permissions filtered by the mask grant access, then these
permissions are used;
* Else if the file group or any named group entry matches a member of the
groups list, but access was not granted by any of those permissions, then
access is denied;
* Otherwise the other permissions of file are tested.
Best practice is to rely on traditional permission bits to implement most
permission requirements, and define a smaller number of ACLs to augment the
permission bits with a few exceptional rules. A file with an ACL incurs an
additional cost in memory in the NameNode compared to a file that has only
permission bits.
* ACLs File System API
New methods:
* <<<public void modifyAclEntries(Path path, List<AclEntry> aclSpec) throws
IOException;>>>
* <<<public void removeAclEntries(Path path, List<AclEntry> aclSpec) throws
IOException;>>>
* <<<public void public void removeDefaultAcl(Path path) throws
IOException;>>>
* <<<public void removeAcl(Path path) throws IOException;>>>
* <<<public void setAcl(Path path, List<AclEntry> aclSpec) throws
IOException;>>>
* <<<public AclStatus getAclStatus(Path path) throws IOException;>>>
* ACLs Shell Commands
* <<<hdfs dfs -getfacl [-R] <path> >>>
Displays the Access Control Lists (ACLs) of files and directories. If a
directory has a default ACL, then getfacl also displays the default ACL.
* <<<hdfs dfs -setfacl [-R] [{-b|-k} {-m|-x <acl_spec>} <path>]|[--set <acl_spec> <path>] >>>
Sets Access Control Lists (ACLs) of files and directories.
* <<<hdfs dfs -ls <args> >>>
The output of <<<ls>>> will append a '+' character to the permissions
string of any file or directory that has an ACL.
See the {{{../hadoop-common/FileSystemShell.html}File System Shell}}
documentation for full coverage of these commands.
* Configuration Parameters
* <<<dfs.permissions.enabled = true>>>
If yes use the permissions system as described here. If no,
permission checking is turned off, but all other behavior is
unchanged. Switching from one parameter value to the other does not
change the mode, owner or group of files or directories.
Regardless of whether permissions are on or off, chmod, chgrp, chown and
setfacl always check permissions. These functions are only useful in
the permissions context, and so there is no backwards compatibility
issue. Furthermore, this allows administrators to reliably set
owners and permissions in advance of turning on regular permissions
checking.
* <<<dfs.web.ugi = webuser,webgroup>>>
The user name to be used by the web server. Setting this to the
name of the super-user allows any web client to see everything.
Changing this to an otherwise unused identity allows web clients to
see only those things visible using "other" permissions. Additional
groups may be added to the comma-separated list.
* <<<dfs.permissions.superusergroup = supergroup>>>
The name of the group of super-users.
* <<<fs.permissions.umask-mode = 0022>>>
The umask used when creating files and directories. For
configuration files, the decimal value 18 may be used.
* <<<dfs.cluster.administrators = ACL-for-admins>>>
The administrators for the cluster specified as an ACL. This
controls who can access the default servlets, etc. in the HDFS.
* <<<dfs.namenode.acls.enabled = true>>>
Set to true to enable support for HDFS ACLs (Access Control Lists). By
default, ACLs are disabled. When ACLs are disabled, the NameNode rejects
all attempts to set an ACL.

View File

@ -1,116 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
HDFS Quotas Guide
---
---
${maven.build.timestamp}
HDFS Quotas Guide
%{toc|section=1|fromDepth=0}
* Overview
The Hadoop Distributed File System (HDFS) allows the administrator to
set quotas for the number of names used and the amount of space used
for individual directories. Name quotas and space quotas operate
independently, but the administration and implementation of the two
types of quotas are closely parallel.
* Name Quotas
The name quota is a hard limit on the number of file and directory
names in the tree rooted at that directory. File and directory
creations fail if the quota would be exceeded. Quotas stick with
renamed directories; the rename operation fails if operation would
result in a quota violation. The attempt to set a quota will still
succeed even if the directory would be in violation of the new quota. A
newly created directory has no associated quota. The largest quota is
Long.Max_Value. A quota of one forces a directory to remain empty.
(Yes, a directory counts against its own quota!)
Quotas are persistent with the fsimage. When starting, if the fsimage
is immediately in violation of a quota (perhaps the fsimage was
surreptitiously modified), a warning is printed for each of such
violations. Setting or removing a quota creates a journal entry.
* Space Quotas
The space quota is a hard limit on the number of bytes used by files in
the tree rooted at that directory. Block allocations fail if the quota
would not allow a full block to be written. Each replica of a block
counts against the quota. Quotas stick with renamed directories; the
rename operation fails if the operation would result in a quota
violation. A newly created directory has no associated quota. The
largest quota is <<<Long.Max_Value>>>. A quota of zero still permits files
to be created, but no blocks can be added to the files. Directories don't
use host file system space and don't count against the space quota. The
host file system space used to save the file meta data is not counted
against the quota. Quotas are charged at the intended replication
factor for the file; changing the replication factor for a file will
credit or debit quotas.
Quotas are persistent with the fsimage. When starting, if the fsimage
is immediately in violation of a quota (perhaps the fsimage was
surreptitiously modified), a warning is printed for each of such
violations. Setting or removing a quota creates a journal entry.
* Administrative Commands
Quotas are managed by a set of commands available only to the
administrator.
* <<<hdfs dfsadmin -setQuota <N> <directory>...<directory> >>>
Set the name quota to be N for each directory. Best effort for each
directory, with faults reported if N is not a positive long
integer, the directory does not exist or it is a file, or the
directory would immediately exceed the new quota.
* <<<hdfs dfsadmin -clrQuota <directory>...<directory> >>>
Remove any name quota for each directory. Best effort for each
directory, with faults reported if the directory does not exist or
it is a file. It is not a fault if the directory has no quota.
* <<<hdfs dfsadmin -setSpaceQuota <N> <directory>...<directory> >>>
Set the space quota to be N bytes for each directory. This is a
hard limit on total size of all the files under the directory tree.
The space quota takes replication also into account, i.e. one GB of
data with replication of 3 consumes 3GB of quota. N can also be
specified with a binary prefix for convenience, for e.g. 50g for 50
gigabytes and 2t for 2 terabytes etc. Best effort for each
directory, with faults reported if N is neither zero nor a positive
integer, the directory does not exist or it is a file, or the
directory would immediately exceed the new quota.
* <<<hdfs dfsadmin -clrSpaceQuota <directory>...<directory> >>>
Remove any space quota for each directory. Best effort for each
directory, with faults reported if the directory does not exist or
it is a file. It is not a fault if the directory has no quota.
* Reporting Command
An an extension to the count command of the HDFS shell reports quota
values and the current count of names and bytes in use.
* <<<hadoop fs -count -q <directory>...<directory> >>>
With the -q option, also report the name quota value set for each
directory, the available name quota remaining, the space quota
value set, and the available space quota remaining. If the
directory does not have a quota set, the reported values are <<<none>>>
and <<<inf>>>.

View File

@ -1,556 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
HDFS Users Guide
---
---
${maven.build.timestamp}
HDFS Users Guide
%{toc|section=1|fromDepth=0}
* Purpose
This document is a starting point for users working with Hadoop
Distributed File System (HDFS) either as a part of a Hadoop cluster or
as a stand-alone general purpose distributed file system. While HDFS is
designed to "just work" in many environments, a working knowledge of
HDFS helps greatly with configuration improvements and diagnostics on a
specific cluster.
* Overview
HDFS is the primary distributed storage used by Hadoop applications. A
HDFS cluster primarily consists of a NameNode that manages the file
system metadata and DataNodes that store the actual data. The HDFS
Architecture Guide describes HDFS in detail. This user guide primarily
deals with the interaction of users and administrators with HDFS
clusters. The HDFS architecture diagram depicts basic interactions
among NameNode, the DataNodes, and the clients. Clients contact
NameNode for file metadata or file modifications and perform actual
file I/O directly with the DataNodes.
The following are some of the salient features that could be of
interest to many users.
* Hadoop, including HDFS, is well suited for distributed storage and
distributed processing using commodity hardware. It is fault
tolerant, scalable, and extremely simple to expand. MapReduce, well
known for its simplicity and applicability for large set of
distributed applications, is an integral part of Hadoop.
* HDFS is highly configurable with a default configuration well
suited for many installations. Most of the time, configuration
needs to be tuned only for very large clusters.
* Hadoop is written in Java and is supported on all major platforms.
* Hadoop supports shell-like commands to interact with HDFS directly.
* The NameNode and Datanodes have built in web servers that makes it
easy to check current status of the cluster.
* New features and improvements are regularly implemented in HDFS.
The following is a subset of useful features in HDFS:
* File permissions and authentication.
* Rack awareness: to take a node's physical location into
account while scheduling tasks and allocating storage.
* Safemode: an administrative mode for maintenance.
* <<<fsck>>>: a utility to diagnose health of the file system, to find
missing files or blocks.
* <<<fetchdt>>>: a utility to fetch DelegationToken and store it in a
file on the local system.
* Balancer: tool to balance the cluster when the data is
unevenly distributed among DataNodes.
* Upgrade and rollback: after a software upgrade, it is possible
to rollback to HDFS' state before the upgrade in case of
unexpected problems.
* Secondary NameNode: performs periodic checkpoints of the
namespace and helps keep the size of file containing log of
HDFS modifications within certain limits at the NameNode.
* Checkpoint node: performs periodic checkpoints of the
namespace and helps minimize the size of the log stored at the
NameNode containing changes to the HDFS. Replaces the role
previously filled by the Secondary NameNode, though is not yet
battle hardened. The NameNode allows multiple Checkpoint nodes
simultaneously, as long as there are no Backup nodes
registered with the system.
* Backup node: An extension to the Checkpoint node. In addition
to checkpointing it also receives a stream of edits from the
NameNode and maintains its own in-memory copy of the
namespace, which is always in sync with the active NameNode
namespace state. Only one Backup node may be registered with
the NameNode at once.
* Prerequisites
The following documents describe how to install and set up a Hadoop
cluster:
* {{{../hadoop-common/SingleCluster.html}Single Node Setup}}
for first-time users.
* {{{../hadoop-common/ClusterSetup.html}Cluster Setup}}
for large, distributed clusters.
The rest of this document assumes the user is able to set up and run a
HDFS with at least one DataNode. For the purpose of this document, both
the NameNode and DataNode could be running on the same physical
machine.
* Web Interface
NameNode and DataNode each run an internal web server in order to
display basic information about the current status of the cluster. With
the default configuration, the NameNode front page is at
<<<http://namenode-name:50070/>>>. It lists the DataNodes in the cluster and
basic statistics of the cluster. The web interface can also be used to
browse the file system (using "Browse the file system" link on the
NameNode front page).
* Shell Commands
Hadoop includes various shell-like commands that directly interact with
HDFS and other file systems that Hadoop supports. The command <<<bin/hdfs dfs -help>>>
lists the commands supported by Hadoop shell. Furthermore,
the command <<<bin/hdfs dfs -help command-name>>> displays more detailed help
for a command. These commands support most of the normal files system
operations like copying files, changing file permissions, etc. It also
supports a few HDFS specific operations like changing replication of
files. For more information see {{{../hadoop-common/FileSystemShell.html}
File System Shell Guide}}.
** DFSAdmin Command
The <<<bin/hdfs dfsadmin>>> command supports a few HDFS administration
related operations. The <<<bin/hdfs dfsadmin -help>>> command lists all the
commands currently supported. For e.g.:
* <<<-report>>>: reports basic statistics of HDFS. Some of this
information is also available on the NameNode front page.
* <<<-safemode>>>: though usually not required, an administrator can
manually enter or leave Safemode.
* <<<-finalizeUpgrade>>>: removes previous backup of the cluster made
during last upgrade.
* <<<-refreshNodes>>>: Updates the namenode with the set of datanodes
allowed to connect to the namenode. Namenodes re-read datanode
hostnames in the file defined by <<<dfs.hosts>>>, <<<dfs.hosts.exclude>>>.
Hosts defined in <<<dfs.hosts>>> are the datanodes that are part of the
cluster. If there are entries in <<<dfs.hosts>>>, only the hosts in it
are allowed to register with the namenode. Entries in
<<<dfs.hosts.exclude>>> are datanodes that need to be decommissioned.
Datanodes complete decommissioning when all the replicas from them
are replicated to other datanodes. Decommissioned nodes are not
automatically shutdown and are not chosen for writing for new
replicas.
* <<<-printTopology>>> : Print the topology of the cluster. Display a tree
of racks and datanodes attached to the tracks as viewed by the
NameNode.
For command usage, see {{{./HDFSCommands.html#dfsadmin}dfsadmin}}.
* Secondary NameNode
The NameNode stores modifications to the file system as a log appended
to a native file system file, edits. When a NameNode starts up, it
reads HDFS state from an image file, fsimage, and then applies edits
from the edits log file. It then writes new HDFS state to the fsimage
and starts normal operation with an empty edits file. Since NameNode
merges fsimage and edits files only during start up, the edits log file
could get very large over time on a busy cluster. Another side effect
of a larger edits file is that next restart of NameNode takes longer.
The secondary NameNode merges the fsimage and the edits log files
periodically and keeps edits log size within a limit. It is usually run
on a different machine than the primary NameNode since its memory
requirements are on the same order as the primary NameNode.
The start of the checkpoint process on the secondary NameNode is
controlled by two configuration parameters.
* <<<dfs.namenode.checkpoint.period>>>, set to 1 hour by default, specifies
the maximum delay between two consecutive checkpoints, and
* <<<dfs.namenode.checkpoint.txns>>>, set to 1 million by default, defines the
number of uncheckpointed transactions on the NameNode which will
force an urgent checkpoint, even if the checkpoint period has not
been reached.
The secondary NameNode stores the latest checkpoint in a directory
which is structured the same way as the primary NameNode's directory.
So that the check pointed image is always ready to be read by the
primary NameNode if necessary.
For command usage,
see {{{./HDFSCommands.html#secondarynamenode}secondarynamenode}}.
* Checkpoint Node
NameNode persists its namespace using two files: fsimage, which is the
latest checkpoint of the namespace and edits, a journal (log) of
changes to the namespace since the checkpoint. When a NameNode starts
up, it merges the fsimage and edits journal to provide an up-to-date
view of the file system metadata. The NameNode then overwrites fsimage
with the new HDFS state and begins a new edits journal.
The Checkpoint node periodically creates checkpoints of the namespace.
It downloads fsimage and edits from the active NameNode, merges them
locally, and uploads the new image back to the active NameNode. The
Checkpoint node usually runs on a different machine than the NameNode
since its memory requirements are on the same order as the NameNode.
The Checkpoint node is started by bin/hdfs namenode -checkpoint on the
node specified in the configuration file.
The location of the Checkpoint (or Backup) node and its accompanying
web interface are configured via the <<<dfs.namenode.backup.address>>> and
<<<dfs.namenode.backup.http-address>>> configuration variables.
The start of the checkpoint process on the Checkpoint node is
controlled by two configuration parameters.
* <<<dfs.namenode.checkpoint.period>>>, set to 1 hour by default, specifies
the maximum delay between two consecutive checkpoints
* <<<dfs.namenode.checkpoint.txns>>>, set to 1 million by default, defines the
number of uncheckpointed transactions on the NameNode which will
force an urgent checkpoint, even if the checkpoint period has not
been reached.
The Checkpoint node stores the latest checkpoint in a directory that is
structured the same as the NameNode's directory. This allows the
checkpointed image to be always available for reading by the NameNode
if necessary. See Import checkpoint.
Multiple checkpoint nodes may be specified in the cluster configuration
file.
For command usage, see {{{./HDFSCommands.html#namenode}namenode}}.
* Backup Node
The Backup node provides the same checkpointing functionality as the
Checkpoint node, as well as maintaining an in-memory, up-to-date copy
of the file system namespace that is always synchronized with the
active NameNode state. Along with accepting a journal stream of file
system edits from the NameNode and persisting this to disk, the Backup
node also applies those edits into its own copy of the namespace in
memory, thus creating a backup of the namespace.
The Backup node does not need to download fsimage and edits files from
the active NameNode in order to create a checkpoint, as would be
required with a Checkpoint node or Secondary NameNode, since it already
has an up-to-date state of the namespace state in memory. The Backup
node checkpoint process is more efficient as it only needs to save the
namespace into the local fsimage file and reset edits.
As the Backup node maintains a copy of the namespace in memory, its RAM
requirements are the same as the NameNode.
The NameNode supports one Backup node at a time. No Checkpoint nodes
may be registered if a Backup node is in use. Using multiple Backup
nodes concurrently will be supported in the future.
The Backup node is configured in the same manner as the Checkpoint
node. It is started with <<<bin/hdfs namenode -backup>>>.
The location of the Backup (or Checkpoint) node and its accompanying
web interface are configured via the <<<dfs.namenode.backup.address>>> and
<<<dfs.namenode.backup.http-address>>> configuration variables.
Use of a Backup node provides the option of running the NameNode with
no persistent storage, delegating all responsibility for persisting the
state of the namespace to the Backup node. To do this, start the
NameNode with the <<<-importCheckpoint>>> option, along with specifying no
persistent storage directories of type edits <<<dfs.namenode.edits.dir>>> for
the NameNode configuration.
For a complete discussion of the motivation behind the creation of the
Backup node and Checkpoint node, see {{{https://issues.apache.org/jira/browse/HADOOP-4539}HADOOP-4539}}.
For command usage, see {{{./HDFSCommands.html#namenode}namenode}}.
* Import Checkpoint
The latest checkpoint can be imported to the NameNode if all other
copies of the image and the edits files are lost. In order to do that
one should:
* Create an empty directory specified in the <<<dfs.namenode.name.dir>>>
configuration variable;
* Specify the location of the checkpoint directory in the
configuration variable <<<dfs.namenode.checkpoint.dir>>>;
* and start the NameNode with <<<-importCheckpoint>>> option.
The NameNode will upload the checkpoint from the
<<<dfs.namenode.checkpoint.dir>>> directory and then save it to the NameNode
directory(s) set in <<<dfs.namenode.name.dir>>>. The NameNode will fail if a
legal image is contained in <<<dfs.namenode.name.dir>>>. The NameNode
verifies that the image in <<<dfs.namenode.checkpoint.dir>>> is consistent,
but does not modify it in any way.
For command usage, see {{{./HDFSCommands.html#namenode}namenode}}.
* Balancer
HDFS data might not always be be placed uniformly across the DataNode.
One common reason is addition of new DataNodes to an existing cluster.
While placing new blocks (data for a file is stored as a series of
blocks), NameNode considers various parameters before choosing the
DataNodes to receive these blocks. Some of the considerations are:
* Policy to keep one of the replicas of a block on the same node as
the node that is writing the block.
* Need to spread different replicas of a block across the racks so
that cluster can survive loss of whole rack.
* One of the replicas is usually placed on the same rack as the node
writing to the file so that cross-rack network I/O is reduced.
* Spread HDFS data uniformly across the DataNodes in the cluster.
Due to multiple competing considerations, data might not be uniformly
placed across the DataNodes. HDFS provides a tool for administrators
that analyzes block placement and rebalanaces data across the DataNode.
A brief administrator's guide for balancer is available at
{{{https://issues.apache.org/jira/browse/HADOOP-1652}HADOOP-1652}}.
For command usage, see {{{./HDFSCommands.html#balancer}balancer}}.
* Rack Awareness
Typically large Hadoop clusters are arranged in racks and network
traffic between different nodes with in the same rack is much more
desirable than network traffic across the racks. In addition NameNode
tries to place replicas of block on multiple racks for improved fault
tolerance. Hadoop lets the cluster administrators decide which rack a
node belongs to through configuration variable
<<<net.topology.script.file.name>>>. When this script is configured, each
node runs the script to determine its rack id. A default installation
assumes all the nodes belong to the same rack. This feature and
configuration is further described in PDF attached to
{{{https://issues.apache.org/jira/browse/HADOOP-692}HADOOP-692}}.
* Safemode
During start up the NameNode loads the file system state from the
fsimage and the edits log file. It then waits for DataNodes to report
their blocks so that it does not prematurely start replicating the
blocks though enough replicas already exist in the cluster. During this
time NameNode stays in Safemode. Safemode for the NameNode is
essentially a read-only mode for the HDFS cluster, where it does not
allow any modifications to file system or blocks. Normally the NameNode
leaves Safemode automatically after the DataNodes have reported that
most file system blocks are available. If required, HDFS could be
placed in Safemode explicitly using <<<bin/hdfs dfsadmin -safemode>>>
command. NameNode front page shows whether Safemode is on or off. A
more detailed description and configuration is maintained as JavaDoc
for <<<setSafeMode()>>>.
* fsck
HDFS supports the fsck command to check for various inconsistencies. It
it is designed for reporting problems with various files, for example,
missing blocks for a file or under-replicated blocks. Unlike a
traditional fsck utility for native file systems, this command does not
correct the errors it detects. Normally NameNode automatically corrects
most of the recoverable failures. By default fsck ignores open files
but provides an option to select all files during reporting. The HDFS
fsck command is not a Hadoop shell command. It can be run as
<<<bin/hdfs fsck>>>. For command usage, see
{{{./HDFSCommands.html#fsck}fsck}}. fsck can be run on
the whole file system or on a subset of files.
* fetchdt
HDFS supports the fetchdt command to fetch Delegation Token and store
it in a file on the local system. This token can be later used to
access secure server (NameNode for example) from a non secure client.
Utility uses either RPC or HTTPS (over Kerberos) to get the token, and
thus requires kerberos tickets to be present before the run (run kinit
to get the tickets). The HDFS fetchdt command is not a Hadoop shell
command. It can be run as <<<bin/hdfs fetchdt DTfile>>>. After you got
the token you can run an HDFS command without having Kerberos tickets,
by pointing <<<HADOOP_TOKEN_FILE_LOCATION>>> environmental variable to the
delegation token file. For command usage, see
{{{./HDFSCommands.html#fetchdt}fetchdt}} command.
* Recovery Mode
Typically, you will configure multiple metadata storage locations.
Then, if one storage location is corrupt, you can read the metadata
from one of the other storage locations.
However, what can you do if the only storage locations available are
corrupt? In this case, there is a special NameNode startup mode called
Recovery mode that may allow you to recover most of your data.
You can start the NameNode in recovery mode like so: <<<namenode -recover>>>
When in recovery mode, the NameNode will interactively prompt you at
the command line about possible courses of action you can take to
recover your data.
If you don't want to be prompted, you can give the <<<-force>>> option. This
option will force recovery mode to always select the first choice.
Normally, this will be the most reasonable choice.
Because Recovery mode can cause you to lose data, you should always
back up your edit log and fsimage before using it.
* Upgrade and Rollback
When Hadoop is upgraded on an existing cluster, as with any software
upgrade, it is possible there are new bugs or incompatible changes that
affect existing applications and were not discovered earlier. In any
non-trivial HDFS installation, it is not an option to loose any data,
let alone to restart HDFS from scratch. HDFS allows administrators to
go back to earlier version of Hadoop and rollback the cluster to the
state it was in before the upgrade. HDFS upgrade is described in more
detail in {{{http://wiki.apache.org/hadoop/Hadoop_Upgrade}Hadoop Upgrade}}
Wiki page. HDFS can have one such backup at a time. Before upgrading,
administrators need to remove existing backup using bin/hadoop dfsadmin
<<<-finalizeUpgrade>>> command. The following briefly describes the
typical upgrade procedure:
* Before upgrading Hadoop software, finalize if there an existing
backup. <<<dfsadmin -upgradeProgress>>> status can tell if the cluster
needs to be finalized.
* Stop the cluster and distribute new version of Hadoop.
* Run the new version with <<<-upgrade>>> option (<<<bin/start-dfs.sh -upgrade>>>).
* Most of the time, cluster works just fine. Once the new HDFS is
considered working well (may be after a few days of operation),
finalize the upgrade. Note that until the cluster is finalized,
deleting the files that existed before the upgrade does not free up
real disk space on the DataNodes.
* If there is a need to move back to the old version,
* stop the cluster and distribute earlier version of Hadoop.
* start the cluster with rollback option. (<<<bin/start-dfs.sh -rollback>>>).
When upgrading to a new version of HDFS, it is necessary to rename or
delete any paths that are reserved in the new version of HDFS. If the
NameNode encounters a reserved path during upgrade, it will print an
error like the following:
<<< /.reserved is a reserved path and .snapshot is a
reserved path component in this version of HDFS. Please rollback and delete
or rename this path, or upgrade with the -renameReserved [key-value pairs]
option to automatically rename these paths during upgrade.>>>
Specifying <<<-upgrade -renameReserved [optional key-value pairs]>>> causes
the NameNode to automatically rename any reserved paths found during
startup. For example, to rename all paths named <<<.snapshot>>> to
<<<.my-snapshot>>> and <<<.reserved>>> to <<<.my-reserved>>>, a user would
specify <<<-upgrade -renameReserved
.snapshot=.my-snapshot,.reserved=.my-reserved>>>.
If no key-value pairs are specified with <<<-renameReserved>>>, the
NameNode will then suffix reserved paths with
<<<.<LAYOUT-VERSION>.UPGRADE_RENAMED>>>, e.g.
<<<.snapshot.-51.UPGRADE_RENAMED>>>.
There are some caveats to this renaming process. It's recommended,
if possible, to first <<<hdfs dfsadmin -saveNamespace>>> before upgrading.
This is because data inconsistency can result if an edit log operation
refers to the destination of an automatically renamed file.
* DataNode Hot Swap Drive
Datanode supports hot swappable drives. The user can add or replace HDFS data
volumes without shutting down the DataNode. The following briefly describes
the typical hot swapping drive procedure:
* If there are new storage directories, the user should format them and mount them
appropriately.
* The user updates the DataNode configuration <<<dfs.datanode.data.dir>>>
to reflect the data volume directories that will be actively in use.
* The user runs <<<dfsadmin -reconfig datanode HOST:PORT start>>> to start
the reconfiguration process. The user can use <<<dfsadmin -reconfig
datanode HOST:PORT status>>> to query the running status of the reconfiguration
task.
* Once the reconfiguration task has completed, the user can safely <<<umount>>>
the removed data volume directories and physically remove the disks.
* File Permissions and Security
The file permissions are designed to be similar to file permissions on
other familiar platforms like Linux. Currently, security is limited to
simple file permissions. The user that starts NameNode is treated as
the superuser for HDFS. Future versions of HDFS will support network
authentication protocols like Kerberos for user authentication and
encryption of data transfers. The details are discussed in the
Permissions Guide.
* Scalability
Hadoop currently runs on clusters with thousands of nodes. The
{{{http://wiki.apache.org/hadoop/PoweredBy}PoweredBy}} Wiki page lists
some of the organizations that deploy Hadoop on large clusters.
HDFS has one NameNode for each cluster. Currently the total memory
available on NameNode is the primary scalability limitation.
On very large clusters, increasing average size of files stored in
HDFS helps with increasing cluster size without increasing memory
requirements on NameNode. The default configuration may not suite
very large clusters. The {{{http://wiki.apache.org/hadoop/FAQ}FAQ}}
Wiki page lists suggested configuration improvements for large Hadoop clusters.
* Related Documentation
This user guide is a good starting point for working with HDFS. While
the user guide continues to improve, there is a large wealth of
documentation about Hadoop and HDFS. The following list is a starting
point for further exploration:
* {{{http://hadoop.apache.org}Hadoop Site}}: The home page for
the Apache Hadoop site.
* {{{http://wiki.apache.org/hadoop/FrontPage}Hadoop Wiki}}:
The home page (FrontPage) for the Hadoop Wiki. Unlike
the released documentation, which is part of Hadoop source tree,
Hadoop Wiki is regularly edited by Hadoop Community.
* {{{http://wiki.apache.org/hadoop/FAQ}FAQ}}: The FAQ Wiki page.
* {{{../../api/index.html}Hadoop JavaDoc API}}.
* Hadoop User Mailing List: user[at]hadoop.apache.org.
* Explore {{{./hdfs-default.xml}hdfs-default.xml}}. It includes
brief description of most of the configuration variables available.
* {{{./HDFSCommands.html}HDFS Commands Guide}}: HDFS commands usage.

View File

@ -1,101 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
C API libhdfs
---
---
${maven.build.timestamp}
C API libhdfs
%{toc|section=1|fromDepth=0}
* Overview
libhdfs is a JNI based C API for Hadoop's Distributed File System
(HDFS). It provides C APIs to a subset of the HDFS APIs to manipulate
HDFS files and the filesystem. libhdfs is part of the Hadoop
distribution and comes pre-compiled in
<<<${HADOOP_HDFS_HOME}/lib/native/libhdfs.so>>> . libhdfs is compatible with
Windows and can be built on Windows by running <<<mvn compile>>> within the
<<<hadoop-hdfs-project/hadoop-hdfs>>> directory of the source tree.
* The APIs
The libhdfs APIs are a subset of the
{{{../../api/org/apache/hadoop/fs/FileSystem.html}Hadoop FileSystem APIs}}.
The header file for libhdfs describes each API in detail and is
available in <<<${HADOOP_HDFS_HOME}/include/hdfs.h>>>.
* A Sample Program
----
\#include "hdfs.h"
int main(int argc, char **argv) {
hdfsFS fs = hdfsConnect("default", 0);
const char* writePath = "/tmp/testfile.txt";
hdfsFile writeFile = hdfsOpenFile(fs, writePath, O_WRONLY|O_CREAT, 0, 0, 0);
if(!writeFile) {
fprintf(stderr, "Failed to open %s for writing!\n", writePath);
exit(-1);
}
char* buffer = "Hello, World!";
tSize num_written_bytes = hdfsWrite(fs, writeFile, (void*)buffer, strlen(buffer)+1);
if (hdfsFlush(fs, writeFile)) {
fprintf(stderr, "Failed to 'flush' %s\n", writePath);
exit(-1);
}
hdfsCloseFile(fs, writeFile);
}
----
* How To Link With The Library
See the CMake file for <<<test_libhdfs_ops.c>>> in the libhdfs source
directory (<<<hadoop-hdfs-project/hadoop-hdfs/src/CMakeLists.txt>>>) or
something like:
<<<gcc above_sample.c -I${HADOOP_HDFS_HOME}/include -L${HADOOP_HDFS_HOME}/lib/native -lhdfs -o above_sample>>>
* Common Problems
The most common problem is the <<<CLASSPATH>>> is not set properly when
calling a program that uses libhdfs. Make sure you set it to all the
Hadoop jars needed to run Hadoop itself as well as the right configuration
directory containing <<<hdfs-site.xml>>>. It is not valid to use wildcard
syntax for specifying multiple jars. It may be useful to run
<<<hadoop classpath --glob>>> or <<<hadoop classpath --jar <path>>>> to
generate the correct classpath for your deployment. See
{{{../hadoop-common/CommandsManual.html#classpath}Hadoop Commands Reference}}
for more information on this command.
* Thread Safe
libdhfs is thread safe.
* Concurrency and Hadoop FS "handles"
The Hadoop FS implementation includes a FS handle cache which
caches based on the URI of the namenode along with the user
connecting. So, all calls to <<<hdfsConnect>>> will return the same
handle but calls to <<<hdfsConnectAsUser>>> with different users will
return different handles. But, since HDFS client handles are
completely thread safe, this has no bearing on concurrency.
* Concurrency and libhdfs/JNI
The libhdfs calls to JNI should always be creating thread local
storage, so (in theory), libhdfs should be as thread safe as the
underlying calls to the Hadoop FS.

View File

@ -1,195 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
Synthetic Load Generator Guide
---
---
${maven.build.timestamp}
Synthetic Load Generator Guide
%{toc|section=1|fromDepth=0}
* Overview
The synthetic load generator (SLG) is a tool for testing NameNode
behavior under different client loads. The user can generate different
mixes of read, write, and list requests by specifying the probabilities
of read and write. The user controls the intensity of the load by
adjusting parameters for the number of worker threads and the delay
between operations. While load generators are running, the user can
profile and monitor the running of the NameNode. When a load generator
exits, it prints some NameNode statistics like the average execution
time of each kind of operation and the NameNode throughput.
* Synopsis
The synopsis of the command is:
----
java LoadGenerator [options]
----
Options include:
* <<<-readProbability>>> <read probability>
The probability of the read operation; default is 0.3333.
* <<<-writeProbability>>> <write probability>
The probability of the write operations; default is 0.3333.
* <<<-root>>> <test space root>
The root of the test space; default is /testLoadSpace.
* <<<-maxDelayBetweenOps>>> <maxDelayBetweenOpsInMillis>
The maximum delay between two consecutive operations in a thread;
default is 0 indicating no delay.
* <<<-numOfThreads>>> <numOfThreads>
The number of threads to spawn; default is 200.
* <<<-elapsedTime>>> <elapsedTimeInSecs>
The number of seconds that the program will run; A value of zero
indicates that the program runs forever. The default value is 0.
* <<<-startTime>>> <startTimeInMillis>
The time that all worker threads start to run. By default it is 10
seconds after the main program starts running.This creates a
barrier if more than one load generator is running.
* <<<-seed>>> <seed>
The random generator seed for repeating requests to NameNode when
running with a single thread; default is the current time.
After command line argument parsing, the load generator traverses the
test space and builds a table of all directories and another table of
all files in the test space. It then waits until the start time to
spawn the number of worker threads as specified by the user. Each
thread sends a stream of requests to NameNode. At each iteration, it
first decides if it is going to read a file, create a file, or list a
directory following the read and write probabilities specified by the
user. The listing probability is equal to 1-read probability-write
probability. When reading, it randomly picks a file in the test space
and reads the entire file. When writing, it randomly picks a directory
in the test space and creates a file there.
To avoid two threads with the same load generator or from two different
load generators creating the same file, the file name consists of the
current machine's host name and the thread id. The length of the file
follows Gaussian distribution with an average size of 2 blocks and the
standard deviation of 1. The new file is filled with byte 'a'. To avoid
the test space growing indefinitely, the file is deleted immediately
after the file creation completes. While listing, it randomly picks a
directory in the test space and lists its content.
After an operation completes, the thread pauses for a random amount of
time in the range of [0, maxDelayBetweenOps] if the specified maximum
delay is not zero. All threads are stopped when the specified elapsed
time is passed. Before exiting, the program prints the average
execution for each kind of NameNode operations, and the number of
requests served by the NameNode per second.
* Test Space Population
The user needs to populate a test space before running a load
generator. The structure generator generates a random test space
structure and the data generator creates the files and directories of
the test space in Hadoop distributed file system.
** Structure Generator
This tool generates a random namespace structure with the following
constraints:
[[1]] The number of subdirectories that a directory can have is a random
number in [minWidth, maxWidth].
[[2]] The maximum depth of each subdirectory is a random number
[2*maxDepth/3, maxDepth].
[[3]] Files are randomly placed in leaf directories. The size of each
file follows Gaussian distribution with an average size of 1 block
and a standard deviation of 1.
The generated namespace structure is described by two files in the
output directory. Each line of the first file contains the full name of
a leaf directory. Each line of the second file contains the full name
of a file and its size, separated by a blank.
The synopsis of the command is:
----
java StructureGenerator [options]
----
Options include:
* <<<-maxDepth>>> <maxDepth>
Maximum depth of the directory tree; default is 5.
* <<<-minWidth>>> <minWidth>
Minimum number of subdirectories per directories; default is 1.
* <<<-maxWidth>>> <maxWidth>
Maximum number of subdirectories per directories; default is 5.
* <<<-numOfFiles>>> <#OfFiles>
The total number of files in the test space; default is 10.
* <<<-avgFileSize>>> <avgFileSizeInBlocks>
Average size of blocks; default is 1.
* <<<-outDir>>> <outDir>
Output directory; default is the current directory.
* <<<-seed>>> <seed>
Random number generator seed; default is the current time.
** Data Generator
This tool reads the directory structure and file structure from the
input directory and creates the namespace in Hadoop distributed file
system. All files are filled with byte 'a'.
The synopsis of the command is:
----
java DataGenerator [options]
----
Options include:
* <<<-inDir>>> <inDir>
Input directory name where directory/file structures are stored;
default is the current directory.
* <<<-root>>> <test space root>
The name of the root directory which the new namespace is going to
be placed under; default is "/testLoadSpace".

View File

@ -1,112 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
Hadoop Distributed File System-${project.version} - Short-Circuit Local Reads
---
---
${maven.build.timestamp}
HDFS Short-Circuit Local Reads
%{toc|section=1|fromDepth=0}
* {Short-Circuit Local Reads}
** Background
In <<<HDFS>>>, reads normally go through the <<<DataNode>>>. Thus, when the
client asks the <<<DataNode>>> to read a file, the <<<DataNode>>> reads that
file off of the disk and sends the data to the client over a TCP socket.
So-called "short-circuit" reads bypass the <<<DataNode>>>, allowing the client
to read the file directly. Obviously, this is only possible in cases where
the client is co-located with the data. Short-circuit reads provide a
substantial performance boost to many applications.
** Setup
To configure short-circuit local reads, you will need to enable
<<<libhadoop.so>>>. See
{{{../hadoop-common/NativeLibraries.html}Native
Libraries}} for details on enabling this library.
Short-circuit reads make use of a UNIX domain socket. This is a special path
in the filesystem that allows the client and the <<<DataNode>>>s to communicate.
You will need to set a path to this socket. The <<<DataNode>>> needs to be able to
create this path. On the other hand, it should not be possible for any user
except the HDFS user or root to create this path. For this reason, paths
under <<</var/run>>> or <<</var/lib>>> are often used.
The client and the <<<DataNode>>> exchange information via a shared memory segment
on <<</dev/shm>>>.
Short-circuit local reads need to be configured on both the <<<DataNode>>>
and the client.
** Example Configuration
Here is an example configuration.
----
<configuration>
<property>
<name>dfs.client.read.shortcircuit</name>
<value>true</value>
</property>
<property>
<name>dfs.domain.socket.path</name>
<value>/var/lib/hadoop-hdfs/dn_socket</value>
</property>
</configuration>
----
* Legacy HDFS Short-Circuit Local Reads
Legacy implementation of short-circuit local reads
on which the clients directly open the HDFS block files
is still available for platforms other than the Linux.
Setting the value of <<<dfs.client.use.legacy.blockreader.local>>>
in addition to <<<dfs.client.read.shortcircuit>>>
to true enables this feature.
You also need to set the value of <<<dfs.datanode.data.dir.perm>>>
to <<<750>>> instead of the default <<<700>>> and
chmod/chown the directory tree under <<<dfs.datanode.data.dir>>>
as readable to the client and the <<<DataNode>>>.
You must take caution because this means that
the client can read all of the block files bypassing HDFS permission.
Because Legacy short-circuit local reads is insecure,
access to this feature is limited to the users listed in
the value of <<<dfs.block.local-path-access.user>>>.
----
<configuration>
<property>
<name>dfs.client.read.shortcircuit</name>
<value>true</value>
</property>
<property>
<name>dfs.client.use.legacy.blockreader.local</name>
<value>true</value>
</property>
<property>
<name>dfs.datanode.data.dir.perm</name>
<value>750</value>
</property>
<property>
<name>dfs.block.local-path-access.user</name>
<value>foo,bar</value>
</property>
</configuration>
----

View File

@ -1,290 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
Hadoop Distributed File System-${project.version} - Transparent Encryption in HDFS
---
---
${maven.build.timestamp}
Transparent Encryption in HDFS
%{toc|section=1|fromDepth=2|toDepth=3}
* {Overview}
HDFS implements <transparent>, <end-to-end> encryption.
Once configured, data read from and written to special HDFS directories is <transparently> encrypted and decrypted without requiring changes to user application code.
This encryption is also <end-to-end>, which means the data can only be encrypted and decrypted by the client.
HDFS never stores or has access to unencrypted data or unencrypted data encryption keys.
This satisfies two typical requirements for encryption: <at-rest encryption> (meaning data on persistent media, such as a disk) as well as <in-transit encryption> (e.g. when data is travelling over the network).
* {Background}
Encryption can be done at different layers in a traditional data management software/hardware stack.
Choosing to encrypt at a given layer comes with different advantages and disadvantages.
* <<Application-level encryption>>. This is the most secure and most flexible approach. The application has ultimate control over what is encrypted and can precisely reflect the requirements of the user. However, writing applications to do this is hard. This is also not an option for customers of existing applications that do not support encryption.
* <<Database-level encryption>>. Similar to application-level encryption in terms of its properties. Most database vendors offer some form of encryption. However, there can be performance issues. One example is that indexes cannot be encrypted.
* <<Filesystem-level encryption>>. This option offers high performance, application transparency, and is typically easy to deploy. However, it is unable to model some application-level policies. For instance, multi-tenant applications might want to encrypt based on the end user. A database might want different encryption settings for each column stored within a single file.
* <<Disk-level encryption>>. Easy to deploy and high performance, but also quite inflexible. Only really protects against physical theft.
HDFS-level encryption fits between database-level and filesystem-level encryption in this stack. This has a lot of positive effects. HDFS encryption is able to provide good performance and existing Hadoop applications are able to run transparently on encrypted data. HDFS also has more context than traditional filesystems when it comes to making policy decisions.
HDFS-level encryption also prevents attacks at the filesystem-level and below (so-called "OS-level attacks"). The operating system and disk only interact with encrypted bytes, since the data is already encrypted by HDFS.
* {Use Cases}
Data encryption is required by a number of different government, financial, and regulatory entities.
For example, the health-care industry has HIPAA regulations, the card payment industry has PCI DSS regulations, and the US government has FISMA regulations.
Having transparent encryption built into HDFS makes it easier for organizations to comply with these regulations.
Encryption can also be performed at the application-level, but by integrating it into HDFS, existing applications can operate on encrypted data without changes.
This integrated architecture implies stronger encrypted file semantics and better coordination with other HDFS functions.
* {Architecture}
** {Overview}
For transparent encryption, we introduce a new abstraction to HDFS: the <encryption zone>.
An encryption zone is a special directory whose contents will be transparently encrypted upon write and transparently decrypted upon read.
Each encryption zone is associated with a single <encryption zone key> which is specified when the zone is created.
Each file within an encryption zone has its own unique <data encryption key (DEK)>.
DEKs are never handled directly by HDFS.
Instead, HDFS only ever handles an <encrypted data encryption key (EDEK)>.
Clients decrypt an EDEK, and then use the subsequent DEK to read and write data.
HDFS datanodes simply see a stream of encrypted bytes.
A new cluster service is required to manage encryption keys: the Hadoop Key Management Server (KMS).
In the context of HDFS encryption, the KMS performs three basic responsibilities:
[[1]] Providing access to stored encryption zone keys
[[1]] Generating new encrypted data encryption keys for storage on the NameNode
[[1]] Decrypting encrypted data encryption keys for use by HDFS clients
The KMS will be described in more detail below.
** {Accessing data within an encryption zone}
When creating a new file in an encryption zone, the NameNode asks the KMS to generate a new EDEK encrypted with the encryption zone's key.
The EDEK is then stored persistently as part of the file's metadata on the NameNode.
When reading a file within an encryption zone, the NameNode provides the client with the file's EDEK and the encryption zone key version used to encrypt the EDEK.
The client then asks the KMS to decrypt the EDEK, which involves checking that the client has permission to access the encryption zone key version.
Assuming that is successful, the client uses the DEK to decrypt the file's contents.
All of the above steps for the read and write path happen automatically through interactions between the DFSClient, the NameNode, and the KMS.
Access to encrypted file data and metadata is controlled by normal HDFS filesystem permissions.
This means that if HDFS is compromised (for example, by gaining unauthorized access to an HDFS superuser account), a malicious user only gains access to ciphertext and encrypted keys.
However, since access to encryption zone keys is controlled by a separate set of permissions on the KMS and key store, this does not pose a security threat.
** {Key Management Server, KeyProvider, EDEKs}
The KMS is a proxy that interfaces with a backing key store on behalf of HDFS daemons and clients.
Both the backing key store and the KMS implement the Hadoop KeyProvider API.
See the {{{../../hadoop-kms/index.html}KMS documentation}} for more information.
In the KeyProvider API, each encryption key has a unique <key name>.
Because keys can be rolled, a key can have multiple <key versions>, where each key version has its own <key material> (the actual secret bytes used during encryption and decryption).
An encryption key can be fetched by either its key name, returning the latest version of the key, or by a specific key version.
The KMS implements additional functionality which enables creation and decryption of <encrypted encryption keys (EEKs)>.
Creation and decryption of EEKs happens entirely on the KMS.
Importantly, the client requesting creation or decryption of an EEK never handles the EEK's encryption key.
To create a new EEK, the KMS generates a new random key, encrypts it with the specified key, and returns the EEK to the client.
To decrypt an EEK, the KMS checks that the user has access to the encryption key, uses it to decrypt the EEK, and returns the decrypted encryption key.
In the context of HDFS encryption, EEKs are <encrypted data encryption keys (EDEKs)>, where a <data encryption key (DEK)> is what is used to encrypt and decrypt file data.
Typically, the key store is configured to only allow end users access to the keys used to encrypt DEKs.
This means that EDEKs can be safely stored and handled by HDFS, since the HDFS user will not have access to unencrypted encryption keys.
* {Configuration}
A necessary prerequisite is an instance of the KMS, as well as a backing key store for the KMS.
See the {{{../../hadoop-kms/index.html}KMS documentation}} for more information.
Once a KMS has been set up and the NameNode and HDFS clients have been correctly configured, an admin can use the <<<hadoop key>>> and <<<hdfs crypto>>> command-line tools to create encryption keys and set up new encryption zones. Existing data can be encrypted by copying it into the new encryption zones using tools like distcp.
** Configuring the cluster KeyProvider
*** dfs.encryption.key.provider.uri
The KeyProvider to use when interacting with encryption keys used when reading and writing to an encryption zone.
** Selecting an encryption algorithm and codec
*** hadoop.security.crypto.codec.classes.EXAMPLECIPHERSUITE
The prefix for a given crypto codec, contains a comma-separated list of implementation classes for a given crypto codec (eg EXAMPLECIPHERSUITE).
The first implementation will be used if available, others are fallbacks.
*** hadoop.security.crypto.codec.classes.aes.ctr.nopadding
Default: <<<org.apache.hadoop.crypto.OpensslAesCtrCryptoCodec,org.apache.hadoop.crypto.JceAesCtrCryptoCodec>>>
Comma-separated list of crypto codec implementations for AES/CTR/NoPadding.
The first implementation will be used if available, others are fallbacks.
*** hadoop.security.crypto.cipher.suite
Default: <<<AES/CTR/NoPadding>>>
Cipher suite for crypto codec.
*** hadoop.security.crypto.jce.provider
Default: None
The JCE provider name used in CryptoCodec.
*** hadoop.security.crypto.buffer.size
Default: <<<8192>>>
The buffer size used by CryptoInputStream and CryptoOutputStream.
** Namenode configuration
*** dfs.namenode.list.encryption.zones.num.responses
Default: <<<100>>>
When listing encryption zones, the maximum number of zones that will be returned in a batch.
Fetching the list incrementally in batches improves namenode performance.
* {<<<crypto>>> command-line interface}
** {createZone}
Usage: <<<[-createZone -keyName <keyName> -path <path>]>>>
Create a new encryption zone.
*--+--+
<path> | The path of the encryption zone to create. It must be an empty directory.
*--+--+
<keyName> | Name of the key to use for the encryption zone.
*--+--+
** {listZones}
Usage: <<<[-listZones]>>>
List all encryption zones. Requires superuser permissions.
* {Example usage}
These instructions assume that you are running as the normal user or HDFS superuser as is appropriate.
Use <<<sudo>>> as needed for your environment.
-------------------------
# As the normal user, create a new encryption key
hadoop key create myKey
# As the super user, create a new empty directory and make it an encryption zone
hadoop fs -mkdir /zone
hdfs crypto -createZone -keyName myKey -path /zone
# chown it to the normal user
hadoop fs -chown myuser:myuser /zone
# As the normal user, put a file in, read it out
hadoop fs -put helloWorld /zone
hadoop fs -cat /zone/helloWorld
-------------------------
* {Distcp considerations}
** {Running as the superuser}
One common usecase for distcp is to replicate data between clusters for backup and disaster recovery purposes.
This is typically performed by the cluster administrator, who is an HDFS superuser.
To enable this same workflow when using HDFS encryption, we introduced a new virtual path prefix, <<</.reserved/raw/>>>, that gives superusers direct access to the underlying block data in the filesystem.
This allows superusers to distcp data without needing having access to encryption keys, and also avoids the overhead of decrypting and re-encrypting data. It also means the source and destination data will be byte-for-byte identical, which would not be true if the data was being re-encrypted with a new EDEK.
When using <<</.reserved/raw>>> to distcp encrypted data, it's important to preserve extended attributes with the {{-px}} flag.
This is because encrypted file attributes (such as the EDEK) are exposed through extended attributes within <<</.reserved/raw>>>, and must be preserved to be able to decrypt the file.
This means that if the distcp is initiated at or above the encryption zone root, it will automatically create an encryption zone at the destination if it does not already exist.
However, it's still recommended that the admin first create identical encryption zones on the destination cluster to avoid any potential mishaps.
** {Copying between encrypted and unencrypted locations}
By default, distcp compares checksums provided by the filesystem to verify that the data was successfully copied to the destination.
When copying between an unencrypted and encrypted location, the filesystem checksums will not match since the underlying block data is different.
In this case, specify the {{-skipcrccheck}} and {{-update}} distcp flags to avoid verifying checksums.
* {Attack vectors}
** {Hardware access exploits}
These exploits assume that attacker has gained physical access to hard drives from cluster machines, i.e. datanodes and namenodes.
[[1]] Access to swap files of processes containing data encryption keys.
* By itself, this does not expose cleartext, as it also requires access to encrypted block files.
* This can be mitigated by disabling swap, using encrypted swap, or using mlock to prevent keys from being swapped out.
[[1]] Access to encrypted block files.
* By itself, this does not expose cleartext, as it also requires access to DEKs.
** {Root access exploits}
These exploits assume that attacker has gained root shell access to cluster machines, i.e. datanodes and namenodes.
Many of these exploits cannot be addressed in HDFS, since a malicious root user has access to the in-memory state of processes holding encryption keys and cleartext.
For these exploits, the only mitigation technique is carefully restricting and monitoring root shell access.
[[1]] Access to encrypted block files.
* By itself, this does not expose cleartext, as it also requires access to encryption keys.
[[1]] Dump memory of client processes to obtain DEKs, delegation tokens, cleartext.
* No mitigation.
[[1]] Recording network traffic to sniff encryption keys and encrypted data in transit.
* By itself, insufficient to read cleartext without the EDEK encryption key.
[[1]] Dump memory of datanode process to obtain encrypted block data.
* By itself, insufficient to read cleartext without the DEK.
[[1]] Dump memory of namenode process to obtain encrypted data encryption keys.
* By itself, insufficient to read cleartext without the EDEK's encryption key and encrypted block files.
** {HDFS admin exploits}
These exploits assume that the attacker has compromised HDFS, but does not have root or <<<hdfs>>> user shell access.
[[1]] Access to encrypted block files.
* By itself, insufficient to read cleartext without the EDEK and EDEK encryption key.
[[1]] Access to encryption zone and encrypted file metadata (including encrypted data encryption keys), via -fetchImage.
* By itself, insufficient to read cleartext without EDEK encryption keys.
** {Rogue user exploits}
A rogue user can collect keys of files they have access to, and use them later to decrypt the encrypted data of those files.
As the user had access to those files, they already had access to the file contents.
This can be mitigated through periodic key rolling policies.

View File

@ -1,304 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
Hadoop Distributed File System-${project.version} - ViewFs Guide
---
---
${maven.build.timestamp}
ViewFs Guide
%{toc|section=1|fromDepth=0}
* {Introduction}
The View File System (ViewFs) provides a way to manage multiple Hadoop file system namespaces (or namespace volumes).
It is particularly useful for clusters having multiple namenodes, and hence multiple namespaces, in {{{./Federation.html}HDFS Federation}}.
ViewFs is analogous to <client side mount tables> in some Unix/Linux systems.
ViewFs can be used to create personalized namespace views and also per-cluster common views.
This guide is presented in the context of Hadoop systems that have several clusters, each cluster may be federated into multiple namespaces.
It also describes how to use ViewFs in federated HDFS to provide a per-cluster global namespace so that applications can operate in a way similar to the pre-federation world.
* The Old World (Prior to Federation)
** Single Namenode Clusters
In the old world prior to {{{./Federation.html}HDFS Federation}}, a cluster has a single namenode which provides a single file system namespace for that cluster.
Suppose there are multiple clusters.
The file system namespaces of each cluster are completely independent and disjoint.
Furthermore, physical storage is NOT shared across clusters (i.e. the Datanodes are not shared across clusters.)
The <<<core-site.xml>>> of each cluster has a configuration property that sets the default file system to the namenode of that cluster:
+-----------------
<property>
<name>fs.default.name</name>
<value>hdfs://namenodeOfClusterX:port</value>
</property>
+-----------------
Such a configuration property allows one to use slash-relative names to resolve paths relative to the cluster namenode.
For example, the path <<</foo/bar>>> is referring to <<<hdfs://namenodeOfClusterX:port/foo/bar>>> using the above configuration.
This configuration property is set on each gateway on the clusters and also on key services of that cluster such the JobTracker and Oozie.
** Pathnames Usage Patterns
Hence on Cluster X where the <<<core-site.xml>>> is set as above, the typical pathnames are
[[1]] <<</foo/bar>>>
* This is equivalent to <<<hdfs://namenodeOfClusterX:port/foo/bar>>> as before.
[[2]] <<<hdfs://namenodeOfClusterX:port/foo/bar>>>
* While this is a valid pathname, one is better using <<</foo/bar>>> as it allows the application and its data to be transparently moved to another cluster when needed.
[[3]] <<<hdfs://namenodeOfClusterY:port/foo/bar>>>
* It is an URI for referring a pathname on another cluster such as Cluster Y.
In particular, the command for copying files from cluster Y to Cluster Z looks like:
+-----------------
distcp hdfs://namenodeClusterY:port/pathSrc hdfs://namenodeClusterZ:port/pathDest
+-----------------
[[4]] <<<webhdfs://namenodeClusterX:http_port/foo/bar>>> and
<<<hftp://namenodeClusterX:http_port/foo/bar>>>
* These are file system URIs respectively for accessing files via the WebHDFS file system and the HFTP file system.
Note that WebHDFS and HFTP use the HTTP port of the namenode but not the RPC port.
[[5]] <<<http://namenodeClusterX:http_port/webhdfs/v1/foo/bar>>> and
<<<http://proxyClusterX:http_port/foo/bar>>>
* These are HTTP URLs respectively for accessing files via {{{./WebHDFS.html}WebHDFS REST API}} and HDFS proxy.
** Pathname Usage Best Practices
When one is within a cluster, it is recommended to use the pathname of type (1) above instead of a fully qualified URI like (2).
Fully qualified URIs are similar to addresses and do not allow the application to move along with its data.
* New World Federation and ViewFs
** How The Clusters Look
Suppose there are multiple clusters.
Each cluster has one or more namenodes.
Each namenode has its own namespace.
A namenode belongs to one and only one cluster.
The namenodes in the same cluster share the physical storage of that cluster.
The namespaces across clusters are independent as before.
Operations decide what is stored on each namenode within a cluster based on the storage needs.
For example, they may put all the user data (<<</user/\<username\>>>>) in one namenode, all the feed-data (<<</data>>>) in another namenode, all the projects (<<</projects>>>) in yet another namenode, etc.
** A Global Namespace Per Cluster Using ViewFs
In order to provide transparency with the old world, the ViewFs file system (i.e. client-side mount table) is used to create each cluster an independent cluster namespace view, which is similar to the namespace in the old world.
The client-side mount tables like the Unix mount tables and they mount the new namespace volumes using the old naming convention.
The following figure shows a mount table mounting four namespace volumes <<</user>>>, <<</data>>>, <<</projects>>>, and <<</tmp>>>:
[./images/viewfs_TypicalMountTable.png]
ViewFs implements the Hadoop file system interface just like HDFS and the local file system.
It is a trivial file system in the sense that it only allows linking to other file systems.
Because ViewFs implements the Hadoop file system interface, it works transparently Hadoop tools.
For example, all the shell commands work with ViewFs as with HDFS and local file system.
The mount points of a mount table are specified in the standard Hadoop configuration files.
In the configuration of each cluster, the default file system is set to the mount table for that cluster as shown below (compare it with the configuration in {{Single Namenode Clusters}}).
+-----------------
<property>
<name>fs.default.name</name>
<value>viewfs://clusterX</value>
</property>
+-----------------
The authority following the <<<viewfs://>>> scheme in the URI is the mount table name.
It is recommanded that the mount table of a cluster should be named by the cluster name.
Then Hadoop system will look for a mount table with the name "clusterX" in the Hadoop configuration files.
Operations arrange all gateways and service machines to contain the mount tables for ALL clusters such that, for each cluster, the default file system is set to the ViewFs mount table for that cluster as described above.
** Pathname Usage Patterns
Hence on Cluster X, where the <<<core-site.xml>>> is set to make the default fs to use the mount table of that cluster, the typical pathnames are
[[1]] <<</foo/bar>>>
* This is equivalent to <<<viewfs://clusterX/foo/bar>>>.
If such pathname is used in the old non-federated world, then the transition to federation world is transparent.
[[2]] <<<viewfs://clusterX/foo/bar>>>
* While this a valid pathname, one is better using <<</foo/bar>>> as it allows the application and its data to be transparently moved to another cluster when needed.
[[3]] <<<viewfs://clusterY/foo/bar>>>
* It is an URI for referring a pathname on another cluster such as Cluster Y.
In particular, the command for copying files from cluster Y to Cluster Z looks like:
+-----------------
distcp viewfs://clusterY:/pathSrc viewfs://clusterZ/pathDest
+-----------------
[[4]] <<<viewfs://clusterX-webhdfs/foo/bar>>> and
<<<viewfs://clusterX-hftp/foo/bar>>>
* These are URIs respectively for accessing files via the WebHDFS file system and the HFTP file system.
[[5]] <<<http://namenodeClusterX:http_port/webhdfs/v1/foo/bar>>> and
<<<http://proxyClusterX:http_port/foo/bar>>>
* These are HTTP URLs respectively for accessing files via {{{./WebHDFS.html}WebHDFS REST API}} and HDFS proxy.
Note that they are the same as before.
** Pathname Usage Best Practices
When one is within a cluster, it is recommended to use the pathname of type (1) above instead of a fully qualified URI like (2).
Futher, applications should not use the knowledge of the mount points and use a path like <<<hdfs://namenodeContainingUserDirs:port/joe/foo/bar>>> to refer to a file in a particular namenode.
One should use <<</user/joe/foo/bar>>> instead.
** Renaming Pathnames Across Namespaces
Recall that one cannot rename files or directories across namenodes or clusters in the old world.
The same is true in the new world but with an additional twist.
For example, in the old world one can perform the commend below.
+-----------------
rename /user/joe/myStuff /data/foo/bar
+-----------------
This will NOT work in the new world if <<</user>>> and <<</data>>> are actually stored on different namenodes within a cluster.
** FAQ
[[1]] <<As I move from non-federated world to the federated world, I will have to keep track of namenodes for different volumes; how do I do that?>>
No, you wont.
See the examples above you are either using a relative name and taking advantage of the default file system, or changing your path from <<<hdfs://namenodeCLusterX/foo/bar>>> to <<<viewfs://clusterX/foo/bar>>>.
[[2]] <<What happens of Operations move some files from one namenode to another namenode within a cluster?>>
Operations may move files from one namenode to another in order to deal with storage capacity issues.
They will do this in a way to avoid applications from breaking.
Let's take some examples.
* Example 1:
<<</user>>> and <<</data>>> were on one namenode and later they need to be on separate namenodes to deal with capacity issues.
Indeed, operations would have created separate mount points for <<</user>>> and <<</data>>>.
Prior to the change the mounts for <<</user>>> and <<</data>>> would have pointed to the same namenode, say <<<namenodeContainingUserAndData>>>.
Operations will update the mount tables so that the mount points are changed to <<<namenodeContaingUser>>> and <<<namenodeContainingData>>>, respectively.
* Example 2:
All projects were fitted on one namenode and but later they need two or more namenodes. ViewFs allows mounts like <<</project/foo>>> and <<</project/bar>>>.
This allows mount tables to be updated to point to the corresponding namenode.
[[3]] <<Is the mount table in each>> <<<core-site.xml>>> <<or in a separate file of its own?>>
The plan is to keep the mount tables in separate files and have the <<<core-site.xml>>> {{{http://www.w3.org/2001/XInclude}xincluding}} it.
While one can keep these files on each machine locally, it is better to use HTTP to access it from a central location.
[[4]] <<Should the configuration have the mount table definitions for only one cluster or all clusters?>>
The configuration should have the mount definitions for all clusters since one needs to have access to data in other clusters such as with distcp.
[[5]] <<When is the mount table actually read given that Operations may change a mount table over time?>>
The mount table is read when the job is submitted to the cluster.
The <<<XInclude>>> in <<<core-site.xml>>> is expanded at job submission time.
This means that if the mount table are changed then the jobs need to be resubmitted.
Due to this reason, we want to implement merge-mount which will greatly reduce the need to change mount tables.
Further, we would like to read the mount tables via another mechanism that is initialized at job start time in the future.
[[6]] <<Will JobTracker (or Yarns Resource Manager) itself use the ViewFs?>>
No, it does not need to.
Neither does the NodeManager.
[[7]] <<Does ViewFs allow only mounts at the top level?>>
No; it is more general.
For example, one can mount <<</user/joe>>> and <<</user/jane>>>.
In this case, an internal read-only directory is created for <<</user>>> in the mount table.
All operations on <<</user>>> are valid except that <<</user>>> is read-only.
[[8]] <<An application works across the clusters and needs to persistently store file paths.
Which paths should it store?>>
You should store <<<viewfs://cluster/path>>> type path names, the same as it uses when running applications.
This insulates you from movement of data within namenodes inside a cluster as long as operations do the moves in a transparent fashion.
It does not insulate you if data gets moved from one cluster to another; the older (pre-federation) world did not protect you form such data movements across clusters anyway.
[[9]] <<What about delegation tokens?>>
Delegation tokens for the cluster to which you are submitting the job (including all mounted volumes for that clusters mount table), and for input and output paths to your map-reduce job (including all volumes mounted via mount tables for the specified input and output paths) are all handled automatically.
In addition, there is a way to add additional delegation tokens to the base cluster configuration for special circumstances.
* Appendix: A Mount Table Configuration Example
Generally, users do not have to define mount tables or the <<<core-site.xml>>> to use the mount table.
This is done by operations and the correct configuration is set on the right gateway machines as is done for <<<core-site.xml>>> today.
The mount tables can be described in <<<core-site.xml>>> but it is better to use indirection in <<<core-site.xml>>> to reference a separate configuration file, say <<<mountTable.xml>>>.
Add the following configuration element to <<<core-site.xml>>> for referencing <<<mountTable.xml>>>:
+-----------------
<configuration xmlns:xi="http://www.w3.org/2001/XInclude">
<xi:include href="mountTable.xml" />
</configuration>
+-----------------
In the file <<<mountTable.xml>>>, there is a definition of the mount table "ClusterX" for the hypothetical cluster that is a federation of the three namespace volumes managed by the three namenodes
[[1]] nn1-clusterx.example.com:8020,
[[2]] nn2-clusterx.example.com:8020, and
[[3]] nn3-clusterx.example.com:8020.
[]
Here <<</home>>> and <<</tmp>>> are in the namespace managed by namenode nn1-clusterx.example.com:8020,
and projects <<</foo>>> and <<</bar>>> are hosted on the other namenodes of the federated cluster.
The home directory base path is set to <<</home>>>
so that each user can access its home directory using the getHomeDirectory() method defined in
{{{../../api/org/apache/hadoop/fs/FileSystem.html}FileSystem}}/{{{../../api/org/apache/hadoop/fs/FileContext.html}FileContext}}.
+-----------------
<configuration>
<property>
<name>fs.viewfs.mounttable.ClusterX.homedir</name>
<value>/home</value>
</property>
<property>
<name>fs.viewfs.mounttable.ClusterX.link./home</name>
<value>hdfs://nn1-clusterx.example.com:8020/home</value>
</property>
<property>
<name>fs.viewfs.mounttable.ClusterX.link./tmp</name>
<value>hdfs://nn1-clusterx.example.com:8020/tmp</value>
</property>
<property>
<name>fs.viewfs.mounttable.ClusterX.link./projects/foo</name>
<value>hdfs://nn2-clusterx.example.com:8020/projects/foo</value>
</property>
<property>
<name>fs.viewfs.mounttable.ClusterX.link./projects/bar</name>
<value>hdfs://nn3-clusterx.example.com:8020/projects/bar</value>
</property>
</configuration>
+-----------------

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,160 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
Archival Storage, SSD & Memory
==============================
* [Archival Storage, SSD & Memory](#Archival_Storage_SSD__Memory)
* [Introduction](#Introduction)
* [Storage Types and Storage Policies](#Storage_Types_and_Storage_Policies)
* [Storage Types: ARCHIVE, DISK, SSD and RAM\_DISK](#Storage_Types:_ARCHIVE_DISK_SSD_and_RAM_DISK)
* [Storage Policies: Hot, Warm, Cold, All\_SSD, One\_SSD and Lazy\_Persist](#Storage_Policies:_Hot_Warm_Cold_All_SSD_One_SSD_and_Lazy_Persist)
* [Storage Policy Resolution](#Storage_Policy_Resolution)
* [Configuration](#Configuration)
* [Mover - A New Data Migration Tool](#Mover_-_A_New_Data_Migration_Tool)
* [Storage Policy Commands](#Storage_Policy_Commands)
* [List Storage Policies](#List_Storage_Policies)
* [Set Storage Policy](#Set_Storage_Policy)
* [Get Storage Policy](#Get_Storage_Policy)
Introduction
------------
*Archival Storage* is a solution to decouple growing storage capacity from compute capacity. Nodes with higher density and less expensive storage with low compute power are becoming available and can be used as cold storage in the clusters. Based on policy the data from hot can be moved to the cold. Adding more nodes to the cold storage can grow the storage independent of the compute capacity in the cluster.
The frameworks provided by Heterogeneous Storage and Archival Storage generalizes the HDFS architecture to include other kinds of storage media including *SSD* and *memory*. Users may choose to store their data in SSD or memory for a better performance.
Storage Types and Storage Policies
----------------------------------
### Storage Types: ARCHIVE, DISK, SSD and RAM\_DISK
The first phase of [Heterogeneous Storage (HDFS-2832)](https://issues.apache.org/jira/browse/HDFS-2832) changed datanode storage model from a single storage, which may correspond to multiple physical storage medias, to a collection of storages with each storage corresponding to a physical storage media. It also added the notion of storage types, DISK and SSD, where DISK is the default storage type.
A new storage type *ARCHIVE*, which has high storage density (petabyte of storage) but little compute power, is added for supporting archival storage.
Another new storage type *RAM\_DISK* is added for supporting writing single replica files in memory.
### Storage Policies: Hot, Warm, Cold, All\_SSD, One\_SSD and Lazy\_Persist
A new concept of storage policies is introduced in order to allow files to be stored in different storage types according to the storage policy.
We have the following storage policies:
* **Hot** - for both storage and compute. The data that is popular and still being used for processing will stay in this policy. When a block is hot, all replicas are stored in DISK.
* **Cold** - only for storage with limited compute. The data that is no longer being used, or data that needs to be archived is moved from hot storage to cold storage. When a block is cold, all replicas are stored in ARCHIVE.
* **Warm** - partially hot and partially cold. When a block is warm, some of its replicas are stored in DISK and the remaining replicas are stored in ARCHIVE.
* **All\_SSD** - for storing all replicas in SSD.
* **One\_SSD** - for storing one of the replicas in SSD. The remaining replicas are stored in DISK.
* **Lazy\_Persist** - for writing blocks with single replica in memory. The replica is first written in RAM\_DISK and then it is lazily persisted in DISK.
More formally, a storage policy consists of the following fields:
1. Policy ID
2. Policy name
3. A list of storage types for block placement
4. A list of fallback storage types for file creation
5. A list of fallback storage types for replication
When there is enough space, block replicas are stored according to the storage type list specified in \#3. When some of the storage types in list \#3 are running out of space, the fallback storage type lists specified in \#4 and \#5 are used to replace the out-of-space storage types for file creation and replication, respectively.
The following is a typical storage policy table.
| **Policy** **ID** | **Policy** **Name** | **Block Placement** **(n  replicas)** | **Fallback storages** **for creation** | **Fallback storages** **for replication** |
|:---- |:---- |:---- |:---- |:---- |
| 15 | Lasy\_Persist | RAM\_DISK: 1, DISK: *n*-1 | DISK | DISK |
| 12 | All\_SSD | SSD: *n* | DISK | DISK |
| 10 | One\_SSD | SSD: 1, DISK: *n*-1 | SSD, DISK | SSD, DISK |
| 7 | Hot (default) | DISK: *n* | \<none\> | ARCHIVE |
| 5 | Warm | DISK: 1, ARCHIVE: *n*-1 | ARCHIVE, DISK | ARCHIVE, DISK |
| 2 | Cold | ARCHIVE: *n* | \<none\> | \<none\> |
Note that the Lasy\_Persist policy is useful only for single replica blocks. For blocks with more than one replicas, all the replicas will be written to DISK since writing only one of the replicas to RAM\_DISK does not improve the overall performance.
### Storage Policy Resolution
When a file or directory is created, its storage policy is *unspecified*. The storage policy can be specified using the "[`dfsadmin -setStoragePolicy`](#Set_Storage_Policy)" command. The effective storage policy of a file or directory is resolved by the following rules.
1. If the file or directory is specificed with a storage policy, return it.
2. For an unspecified file or directory, if it is the root directory, return the *default storage policy*. Otherwise, return its parent's effective storage policy.
The effective storage policy can be retrieved by the "[`dfsadmin -getStoragePolicy`](#Get_Storage_Policy)" command.
### Configuration
* **dfs.storage.policy.enabled** - for enabling/disabling the storage policy feature. The default value is `true`.
Mover - A New Data Migration Tool
---------------------------------
A new data migration tool is added for archiving data. The tool is similar to Balancer. It periodically scans the files in HDFS to check if the block placement satisfies the storage policy. For the blocks violating the storage policy, it moves the replicas to a different storage type in order to fulfill the storage policy requirement.
* Command:
hdfs mover [-p <files/dirs> | -f <local file name>]
* Arguments:
| | |
|:---- |:---- |
| `-p <files/dirs>` | Specify a space separated list of HDFS files/dirs to migrate. |
| `-f <local file>` | Specify a local file containing a list of HDFS files/dirs to migrate. |
Note that, when both -p and -f options are omitted, the default path is the root directory.
Storage Policy Commands
-----------------------
### List Storage Policies
List out all the storage policies.
* Command:
hdfs storagepolicies -listPolicies
* Arguments: none.
### Set Storage Policy
Set a storage policy to a file or a directory.
* Command:
hdfs storagepolicies -setStoragePolicy -path <path> -policy <policy>
* Arguments:
| | |
|:---- |:---- |
| `-path <path>` | The path referring to either a directory or a file. |
| `-policy <policy>` | The name of the storage policy. |
### Get Storage Policy
Get the storage policy of a file or a directory.
* Command:
hdfs storagepolicies -getStoragePolicy -path <path>
* Arguments:
| | |
|:---- |:---- |
| `-path <path>` | The path referring to either a directory or a file. |

View File

@ -0,0 +1,268 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
Centralized Cache Management in HDFS
====================================
* [Overview](#Overview)
* [Use Cases](#Use_Cases)
* [Architecture](#Architecture)
* [Concepts](#Concepts)
* [Cache directive](#Cache_directive)
* [Cache pool](#Cache_pool)
* [cacheadmin command-line interface](#cacheadmin_command-line_interface)
* [Cache directive commands](#Cache_directive_commands)
* [addDirective](#addDirective)
* [removeDirective](#removeDirective)
* [removeDirectives](#removeDirectives)
* [listDirectives](#listDirectives)
* [Cache pool commands](#Cache_pool_commands)
* [addPool](#addPool)
* [modifyPool](#modifyPool)
* [removePool](#removePool)
* [listPools](#listPools)
* [help](#help)
* [Configuration](#Configuration)
* [Native Libraries](#Native_Libraries)
* [Configuration Properties](#Configuration_Properties)
* [Required](#Required)
* [Optional](#Optional)
* [OS Limits](#OS_Limits)
Overview
--------
*Centralized cache management* in HDFS is an explicit caching mechanism that allows users to specify *paths* to be cached by HDFS. The NameNode will communicate with DataNodes that have the desired blocks on disk, and instruct them to cache the blocks in off-heap caches.
Centralized cache management in HDFS has many significant advantages.
1. Explicit pinning prevents frequently used data from being evicted from memory. This is particularly important when the size of the working set exceeds the size of main memory, which is common for many HDFS workloads.
2. Because DataNode caches are managed by the NameNode, applications can query the set of cached block locations when making task placement decisions. Co-locating a task with a cached block replica improves read performance.
3. When block has been cached by a DataNode, clients can use a new , more-efficient, zero-copy read API. Since checksum verification of cached data is done once by the DataNode, clients can incur essentially zero overhead when using this new API.
4. Centralized caching can improve overall cluster memory utilization. When relying on the OS buffer cache at each DataNode, repeated reads of a block will result in all *n* replicas of the block being pulled into buffer cache. With centralized cache management, a user can explicitly pin only *m* of the *n* replicas, saving *n-m* memory.
Use Cases
---------
Centralized cache management is useful for files that accessed repeatedly. For example, a small *fact table* in Hive which is often used for joins is a good candidate for caching. On the other hand, caching the input of a *one year reporting query* is probably less useful, since the historical data might only be read once.
Centralized cache management is also useful for mixed workloads with performance SLAs. Caching the working set of a high-priority workload insures that it does not contend for disk I/O with a low-priority workload.
Architecture
------------
![Caching Architecture](images/caching.png)
In this architecture, the NameNode is responsible for coordinating all the DataNode off-heap caches in the cluster. The NameNode periodically receives a *cache report* from each DataNode which describes all the blocks cached on a given DN. The NameNode manages DataNode caches by piggybacking cache and uncache commands on the DataNode heartbeat.
The NameNode queries its set of *cache directives* to determine which paths should be cached. Cache directives are persistently stored in the fsimage and edit log, and can be added, removed, and modified via Java and command-line APIs. The NameNode also stores a set of *cache pools*, which are administrative entities used to group cache directives together for resource management and enforcing permissions.
The NameNode periodically rescans the namespace and active cache directives to determine which blocks need to be cached or uncached and assign caching work to DataNodes. Rescans can also be triggered by user actions like adding or removing a cache directive or removing a cache pool.
We do not currently cache blocks which are under construction, corrupt, or otherwise incomplete. If a cache directive covers a symlink, the symlink target is not cached.
Caching is currently done on the file or directory-level. Block and sub-block caching is an item of future work.
Concepts
--------
### Cache directive
A *cache directive* defines a path that should be cached. Paths can be either directories or files. Directories are cached non-recursively, meaning only files in the first-level listing of the directory.
Directives also specify additional parameters, such as the cache replication factor and expiration time. The replication factor specifies the number of block replicas to cache. If multiple cache directives refer to the same file, the maximum cache replication factor is applied.
The expiration time is specified on the command line as a *time-to-live (TTL)*, a relative expiration time in the future. After a cache directive expires, it is no longer considered by the NameNode when making caching decisions.
### Cache pool
A *cache pool* is an administrative entity used to manage groups of cache directives. Cache pools have UNIX-like *permissions*, which restrict which users and groups have access to the pool. Write permissions allow users to add and remove cache directives to the pool. Read permissions allow users to list the cache directives in a pool, as well as additional metadata. Execute permissions are unused.
Cache pools are also used for resource management. Pools can enforce a maximum *limit*, which restricts the number of bytes that can be cached in aggregate by directives in the pool. Normally, the sum of the pool limits will approximately equal the amount of aggregate memory reserved for HDFS caching on the cluster. Cache pools also track a number of statistics to help cluster users determine what is and should be cached.
Pools also can enforce a maximum time-to-live. This restricts the maximum expiration time of directives being added to the pool.
`cacheadmin` command-line interface
-----------------------------------
On the command-line, administrators and users can interact with cache pools and directives via the `hdfs cacheadmin` subcommand.
Cache directives are identified by a unique, non-repeating 64-bit integer ID. IDs will not be reused even if a cache directive is later removed.
Cache pools are identified by a unique string name.
### Cache directive commands
#### addDirective
Usage: `hdfs cacheadmin -addDirective -path <path> -pool <pool-name> [-force] [-replication <replication>] [-ttl <time-to-live>]`
Add a new cache directive.
| | |
|:---- |:---- |
| \<path\> | A path to cache. The path can be a directory or a file. |
| \<pool-name\> | The pool to which the directive will be added. You must have write permission on the cache pool in order to add new directives. |
| -force | Skips checking of cache pool resource limits. |
| \<replication\> | The cache replication factor to use. Defaults to 1. |
| \<time-to-live\> | How long the directive is valid. Can be specified in minutes, hours, and days, e.g. 30m, 4h, 2d. Valid units are [smhd]. "never" indicates a directive that never expires. If unspecified, the directive never expires. |
#### removeDirective
Usage: `hdfs cacheadmin -removeDirective <id> `
Remove a cache directive.
| | |
|:---- |:---- |
| \<id\> | The id of the cache directive to remove. You must have write permission on the pool of the directive in order to remove it. To see a list of cachedirective IDs, use the -listDirectives command. |
#### removeDirectives
Usage: `hdfs cacheadmin -removeDirectives <path> `
Remove every cache directive with the specified path.
| | |
|:---- |:---- |
| \<path\> | The path of the cache directives to remove. You must have write permission on the pool of the directive in order to remove it. To see a list of cache directives, use the -listDirectives command. |
#### listDirectives
Usage: `hdfs cacheadmin -listDirectives [-stats] [-path <path>] [-pool <pool>]`
List cache directives.
| | |
|:---- |:---- |
| \<path\> | List only cache directives with this path. Note that if there is a cache directive for *path* in a cache pool that we don't have read access for, it will not be listed. |
| \<pool\> | List only path cache directives in that pool. |
| -stats | List path-based cache directive statistics. |
### Cache pool commands
#### addPool
Usage: `hdfs cacheadmin -addPool <name> [-owner <owner>] [-group <group>] [-mode <mode>] [-limit <limit>] [-maxTtl <maxTtl`\>
Add a new cache pool.
| | |
|:---- |:---- |
| \<name\> | Name of the new pool. |
| \<owner\> | Username of the owner of the pool. Defaults to the current user. |
| \<group\> | Group of the pool. Defaults to the primary group name of the current user. |
| \<mode\> | UNIX-style permissions for the pool. Permissions are specified in octal, e.g. 0755. By default, this is set to 0755. |
| \<limit\> | The maximum number of bytes that can be cached by directives in this pool, in aggregate. By default, no limit is set. |
| \<maxTtl\> | The maximum allowed time-to-live for directives being added to the pool. This can be specified in seconds, minutes, hours, and days, e.g. 120s, 30m, 4h, 2d. Valid units are [smhd]. By default, no maximum is set. A value of  "never " specifies that there is no limit. |
#### modifyPool
Usage: `hdfs cacheadmin -modifyPool <name> [-owner <owner>] [-group <group>] [-mode <mode>] [-limit <limit>] [-maxTtl <maxTtl>]`
Modifies the metadata of an existing cache pool.
| | |
|:---- |:---- |
| \<name\> | Name of the pool to modify. |
| \<owner\> | Username of the owner of the pool. |
| \<group\> | Groupname of the group of the pool. |
| \<mode\> | Unix-style permissions of the pool in octal. |
| \<limit\> | Maximum number of bytes that can be cached by this pool. |
| \<maxTtl\> | The maximum allowed time-to-live for directives being added to the pool. |
#### removePool
Usage: `hdfs cacheadmin -removePool <name> `
Remove a cache pool. This also uncaches paths associated with the pool.
| | |
|:---- |:---- |
| \<name\> | Name of the cache pool to remove. |
#### listPools
Usage: `hdfs cacheadmin -listPools [-stats] [<name>]`
Display information about one or more cache pools, e.g. name, owner, group, permissions, etc.
| | |
|:---- |:---- |
| -stats | Display additional cache pool statistics. |
| \<name\> | If specified, list only the named cache pool. |
#### help
Usage: `hdfs cacheadmin -help <command-name> `
Get detailed help about a command.
| | |
|:---- |:---- |
| \<command-name\> | The command for which to get detailed help. If no command is specified, print detailed help for all commands. |
Configuration
-------------
### Native Libraries
In order to lock block files into memory, the DataNode relies on native JNI code found in `libhadoop.so` or `hadoop.dll` on Windows. Be sure to [enable JNI](../hadoop-common/NativeLibraries.html) if you are using HDFS centralized cache management.
### Configuration Properties
#### Required
Be sure to configure the following:
* dfs.datanode.max.locked.memory
This determines the maximum amount of memory a DataNode will use for caching. On Unix-like systems, the "locked-in-memory size" ulimit (`ulimit -l`) of the DataNode user also needs to be increased to match this parameter (see below section on [OS Limits](#OS_Limits)). When setting this value, please remember that you will need space in memory for other things as well, such as the DataNode and application JVM heaps and the operating system page cache.
#### Optional
The following properties are not required, but may be specified for tuning:
* dfs.namenode.path.based.cache.refresh.interval.ms
The NameNode will use this as the amount of milliseconds between subsequent path cache rescans. This calculates the blocks to cache and each DataNode containing a replica of the block that should cache it.
By default, this parameter is set to 300000, which is five minutes.
* dfs.datanode.fsdatasetcache.max.threads.per.volume
The DataNode will use this as the maximum number of threads per volume to use for caching new data.
By default, this parameter is set to 4.
* dfs.cachereport.intervalMsec
The DataNode will use this as the amount of milliseconds between sending a full report of its cache state to the NameNode.
By default, this parameter is set to 10000, which is 10 seconds.
* dfs.namenode.path.based.cache.block.map.allocation.percent
The percentage of the Java heap which we will allocate to the cached blocks map. The cached blocks map is a hash map which uses chained hashing. Smaller maps may be accessed more slowly if the number of cached blocks is large; larger maps will consume more memory. The default is 0.25 percent.
### OS Limits
If you get the error "Cannot start datanode because the configured max locked memory size... is more than the datanode's available RLIMIT\_MEMLOCK ulimit," that means that the operating system is imposing a lower limit on the amount of memory that you can lock than what you have configured. To fix this, you must adjust the ulimit -l value that the DataNode runs with. Usually, this value is configured in `/etc/security/limits.conf`. However, it will vary depending on what operating system and distribution you are using.
You will know that you have correctly configured this value when you can run `ulimit -l` from the shell and get back either a higher value than what you have configured with `dfs.datanode.max.locked.memory`, or the string "unlimited," indicating that there is no limit. Note that it's typical for `ulimit -l` to output the memory lock limit in KB, but dfs.datanode.max.locked.memory must be specified in bytes.
This information does not apply to deployments on Windows. Windows has no direct equivalent of `ulimit -l`.

View File

@ -0,0 +1,98 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
Extended Attributes in HDFS
===========================
* [Overview](#Overview)
* [HDFS extended attributes](#HDFS_extended_attributes)
* [Namespaces and Permissions](#Namespaces_and_Permissions)
* [Interacting with extended attributes](#Interacting_with_extended_attributes)
* [getfattr](#getfattr)
* [setfattr](#setfattr)
* [Configuration options](#Configuration_options)
Overview
--------
*Extended attributes* (abbreviated as *xattrs*) are a filesystem feature that allow user applications to associate additional metadata with a file or directory. Unlike system-level inode metadata such as file permissions or modification time, extended attributes are not interpreted by the system and are instead used by applications to store additional information about an inode. Extended attributes could be used, for instance, to specify the character encoding of a plain-text document.
### HDFS extended attributes
Extended attributes in HDFS are modeled after extended attributes in Linux (see the Linux manpage for [attr(5)](http://www.bestbits.at/acl/man/man5/attr.txt) and [related documentation](http://www.bestbits.at/acl/)). An extended attribute is a *name-value pair*, with a string name and binary value. Xattrs names must also be prefixed with a *namespace*. For example, an xattr named *myXattr* in the *user* namespace would be specified as **user.myXattr**. Multiple xattrs can be associated with a single inode.
### Namespaces and Permissions
In HDFS, there are five valid namespaces: `user`, `trusted`, `system`, `security`, and `raw`. Each of these namespaces have different access restrictions.
The `user` namespace is the namespace that will commonly be used by client applications. Access to extended attributes in the user namespace is controlled by the corresponding file permissions.
The `trusted` namespace is available only to HDFS superusers.
The `system` namespace is reserved for internal HDFS use. This namespace is not accessible through userspace methods, and is reserved for implementing internal HDFS features.
The `security` namespace is reserved for internal HDFS use. This namespace is generally not accessible through userspace methods. One particular use of `security` is the `security.hdfs.unreadable.by.superuser` extended attribute. This xattr can only be set on files, and it will prevent the superuser from reading the file's contents. The superuser can still read and modify file metadata, such as the owner, permissions, etc. This xattr can be set and accessed by any user, assuming normal filesystem permissions. This xattr is also write-once, and cannot be removed once set. This xattr does not allow a value to be set.
The `raw` namespace is reserved for internal system attributes that sometimes need to be exposed. Like `system` namespace attributes they are not visible to the user except when `getXAttr`/`getXAttrs` is called on a file or directory in the `/.reserved/raw` HDFS directory hierarchy. These attributes can only be accessed by the superuser. An example of where `raw` namespace extended attributes are used is the `distcp` utility. Encryption zone meta data is stored in `raw.*` extended attributes, so as long as the administrator uses `/.reserved/raw` pathnames in source and target, the encrypted files in the encryption zones are transparently copied.
Interacting with extended attributes
------------------------------------
The Hadoop shell has support for interacting with extended attributes via `hadoop fs -getfattr` and `hadoop fs -setfattr`. These commands are styled after the Linux [getfattr(1)](http://www.bestbits.at/acl/man/man1/getfattr.txt) and [setfattr(1)](http://www.bestbits.at/acl/man/man1/setfattr.txt) commands.
### getfattr
`hadoop fs -getfattr [-R] -n name | -d [-e en] <path`\>
Displays the extended attribute names and values (if any) for a file or directory.
| | |
|:---- |:---- |
| -R | Recursively list the attributes for all files and directories. |
| -n name | Dump the named extended attribute value. |
| -d | Dump all extended attribute values associated with pathname. |
| -e \<encoding\> | Encode values after retrieving them. Valid encodings are "text", "hex", and "base64". Values encoded as text strings are enclosed in double quotes ("), and values encoded as hexadecimal and base64 are prefixed with 0x and 0s, respectively. |
| \<path\> | The file or directory. |
### setfattr
`hadoop fs -setfattr -n name [-v value] | -x name <path`\>
Sets an extended attribute name and value for a file or directory.
| | |
|:---- |:---- |
| -n name | The extended attribute name. |
| -v value | The extended attribute value. There are three different encoding methods for the value. If the argument is enclosed in double quotes, then the value is the string inside the quotes. If the argument is prefixed with 0x or 0X, then it is taken as a hexadecimal number. If the argument begins with 0s or 0S, then it is taken as a base64 encoding. |
| -x name | Remove the extended attribute. |
| \<path\> | The file or directory. |
Configuration options
---------------------
HDFS supports extended attributes out of the box, without additional configuration. Administrators could potentially be interested in the options limiting the number of xattrs per inode and the size of xattrs, since xattrs increase the on-disk and in-memory space consumption of an inode.
* `dfs.namenode.xattrs.enabled`
Whether support for extended attributes is enabled on the NameNode. By default, extended attributes are enabled.
* `dfs.namenode.fs-limits.max-xattrs-per-inode`
The maximum number of extended attributes per inode. By default, this limit is 32.
* `dfs.namenode.fs-limits.max-xattr-size`
The maximum combined size of the name and value of an extended attribute in bytes. By default, this limit is 16384 bytes.

View File

@ -0,0 +1,254 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
Fault Injection Framework and Development Guide
===============================================
* [Fault Injection Framework and Development Guide](#Fault_Injection_Framework_and_Development_Guide)
* [Introduction](#Introduction)
* [Assumptions](#Assumptions)
* [Architecture of the Fault Injection Framework](#Architecture_of_the_Fault_Injection_Framework)
* [Configuration Management](#Configuration_Management)
* [Probability Model](#Probability_Model)
* [Fault Injection Mechanism: AOP and AspectJ](#Fault_Injection_Mechanism:_AOP_and_AspectJ)
* [Existing Join Points](#Existing_Join_Points)
* [Aspect Example](#Aspect_Example)
* [Fault Naming Convention and Namespaces](#Fault_Naming_Convention_and_Namespaces)
* [Development Tools](#Development_Tools)
* [Putting It All Together](#Putting_It_All_Together)
* [How to Use the Fault Injection Framework](#How_to_Use_the_Fault_Injection_Framework)
* [Additional Information and Contacts](#Additional_Information_and_Contacts)
Introduction
------------
This guide provides an overview of the Hadoop Fault Injection (FI) framework for those who will be developing their own faults (aspects).
The idea of fault injection is fairly simple: it is an infusion of errors and exceptions into an application's logic to achieve a higher coverage and fault tolerance of the system. Different implementations of this idea are available today. Hadoop's FI framework is built on top of Aspect Oriented Paradigm (AOP) implemented by AspectJ toolkit.
Assumptions
-----------
The current implementation of the FI framework assumes that the faults it will be emulating are of non-deterministic nature. That is, the moment of a fault's happening isn't known in advance and is a coin-flip based.
Architecture of the Fault Injection Framework
---------------------------------------------
Components layout
### Configuration Management
This piece of the FI framework allows you to set expectations for faults to happen. The settings can be applied either statically (in advance) or in runtime. The desired level of faults in the framework can be configured two ways:
* editing src/aop/fi-site.xml configuration file. This file is
similar to other Hadoop's config files
* setting system properties of JVM through VM startup parameters or
in build.properties file
### Probability Model
This is fundamentally a coin flipper. The methods of this class are getting a random number between 0.0 and 1.0 and then checking if a new number has happened in the range of 0.0 and a configured level for the fault in question. If that condition is true then the fault will occur.
Thus, to guarantee the happening of a fault one needs to set an appropriate level to 1.0. To completely prevent a fault from happening its probability level has to be set to 0.0.
Note: The default probability level is set to 0 (zero) unless the level is changed explicitly through the configuration file or in the runtime. The name of the default level's configuration parameter is fi.\*
### Fault Injection Mechanism: AOP and AspectJ
The foundation of Hadoop's FI framework includes a cross-cutting concept implemented by AspectJ. The following basic terms are important to remember:
* A cross-cutting concept (aspect) is behavior, and often data, that
is used across the scope of a piece of software
* In AOP, the aspects provide a mechanism by which a cross-cutting concern
can be specified in a modular way
* Advice is the code that is executed when an aspect is invoked
* Join point (or pointcut) is a specific point within the application
that may or not invoke some advice
### Existing Join Points
The following readily available join points are provided by AspectJ:
* Join when a method is called
* Join during a method's execution
* Join when a constructor is invoked
* Join during a constructor's execution
* Join during aspect advice execution
* Join before an object is initialized
* Join during object initialization
* Join during static initializer execution
* Join when a class's field is referenced
* Join when a class's field is assigned
* Join when a handler is executed
Aspect Example
--------------
```java
package org.apache.hadoop.hdfs.server.datanode;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.fi.ProbabilityModel;
import org.apache.hadoop.hdfs.server.datanode.DataNode;
import org.apache.hadoop.util.DiskChecker.*;
import java.io.IOException;
import java.io.OutputStream;
import java.io.DataOutputStream;
/**
* This aspect takes care about faults injected into datanode.BlockReceiver
* class
*/
public aspect BlockReceiverAspects {
public static final Log LOG = LogFactory.getLog(BlockReceiverAspects.class);
public static final String BLOCK_RECEIVER_FAULT="hdfs.datanode.BlockReceiver";
pointcut callReceivePacket() : call (* OutputStream.write(..))
&& withincode (* BlockReceiver.receivePacket(..))
// to further limit the application of this aspect a very narrow 'target' can be used as follows
// && target(DataOutputStream)
&& !within(BlockReceiverAspects +);
before () throws IOException : callReceivePacket () {
if (ProbabilityModel.injectCriteria(BLOCK_RECEIVER_FAULT)) {
LOG.info("Before the injection point");
Thread.dumpStack();
throw new DiskOutOfSpaceException ("FI: injected fault point at " +
thisJoinPoint.getStaticPart( ).getSourceLocation());
}
}
}
```
The aspect has two main parts:
* The join point pointcut callReceivepacket() which servers as an
identification mark of a specific point (in control and/or data
flow) in the life of an application.
* A call to the advice - before () throws IOException :
callReceivepacket() - will be injected (see Putting It All
Together) before that specific spot of the application's code.
The pointcut identifies an invocation of class' java.io.OutputStream write() method with any number of parameters and any return type. This invoke should take place within the body of method receivepacket() from classBlockReceiver. The method can have any parameters and any return type. Possible invocations of write() method happening anywhere within the aspect BlockReceiverAspects or its heirs will be ignored.
Note 1: This short example doesn't illustrate the fact that you can have more than a single injection point per class. In such a case the names of the faults have to be different if a developer wants to trigger them separately.
Note 2: After the injection step (see Putting It All Together) you can verify that the faults were properly injected by searching for ajc keywords in a disassembled class file.
Fault Naming Convention and Namespaces
--------------------------------------
For the sake of a unified naming convention the following two types of names are recommended for a new aspects development:
* Activity specific notation (when we don't care about a particular
location of a fault's happening). In this case the name of the
fault is rather abstract: fi.hdfs.DiskError
* Location specific notation. Here, the fault's name is mnemonic as
in: fi.hdfs.datanode.BlockReceiver[optional location details]
Development Tools
-----------------
* The Eclipse AspectJ Development Toolkit may help you when developing aspects
* IntelliJ IDEA provides AspectJ weaver and Spring-AOP plugins
Putting It All Together
-----------------------
Faults (aspects) have to injected (or woven) together before they can be used. Follow these instructions: \* To weave aspects in place use:
% ant injectfaults
* If you misidentified the join point of your aspect you will see a warning (similar to the one shown here) when 'injectfaults' target is completed:
[iajc] warning at
src/test/aop/org/apache/hadoop/hdfs/server/datanode/ \
BlockReceiverAspects.aj:44::0
advice defined in org.apache.hadoop.hdfs.server.datanode.BlockReceiverAspects
has not been applied [Xlint:adviceDidNotMatch]
* It isn't an error, so the build will report the successful result. To prepare dev.jar file with all your faults weaved in place (HDFS-475 pending) use:
% ant jar-fault-inject
* To create test jars use:
% ant jar-test-fault-inject
* To run HDFS tests with faults injected use:
% ant run-test-hdfs-fault-inject
### How to Use the Fault Injection Framework
Faults can be triggered as follows:
* During runtime:
% ant run-test-hdfs -Dfi.hdfs.datanode.BlockReceiver=0.12
To set a certain level, for example 25%, of all injected faults use:
% ant run-test-hdfs-fault-inject -Dfi.*=0.25
* From a program:
```java
package org.apache.hadoop.fs;
import org.junit.Test;
import org.junit.Before;
public class DemoFiTest {
public static final String BLOCK_RECEIVER_FAULT="hdfs.datanode.BlockReceiver";
@Override
@Before
public void setUp() {
//Setting up the test's environment as required
}
@Test
public void testFI() {
// It triggers the fault, assuming that there's one called 'hdfs.datanode.BlockReceiver'
System.setProperty("fi." + BLOCK_RECEIVER_FAULT, "0.12");
//
// The main logic of your tests goes here
//
// Now set the level back to 0 (zero) to prevent this fault from happening again
System.setProperty("fi." + BLOCK_RECEIVER_FAULT, "0.0");
// or delete its trigger completely
System.getProperties().remove("fi." + BLOCK_RECEIVER_FAULT);
}
@Override
@After
public void tearDown() {
//Cleaning up test test environment
}
}
```
As you can see above these two methods do the same thing. They are setting the probability level of `hdfs.datanode.BlockReceiver` at 12%. The difference, however, is that the program provides more flexibility and allows you to turn a fault off when a test no longer needs it.
Additional Information and Contacts
-----------------------------------
These two sources of information are particularly interesting and worth reading:
* <http://www.eclipse.org/aspectj/doc/next/devguide/>
* AspectJ Cookbook (ISBN-13: 978-0-596-00654-9)
If you have additional comments or questions for the author check [HDFS-435](https://issues.apache.org/jira/browse/HDFS-435).

View File

@ -0,0 +1,254 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
HDFS Federation
===============
* [HDFS Federation](#HDFS_Federation)
* [Background](#Background)
* [Multiple Namenodes/Namespaces](#Multiple_NamenodesNamespaces)
* [Key Benefits](#Key_Benefits)
* [Federation Configuration](#Federation_Configuration)
* [Configuration:](#Configuration:)
* [Formatting Namenodes](#Formatting_Namenodes)
* [Upgrading from an older release and configuring federation](#Upgrading_from_an_older_release_and_configuring_federation)
* [Adding a new Namenode to an existing HDFS cluster](#Adding_a_new_Namenode_to_an_existing_HDFS_cluster)
* [Managing the cluster](#Managing_the_cluster)
* [Starting and stopping cluster](#Starting_and_stopping_cluster)
* [Balancer](#Balancer)
* [Decommissioning](#Decommissioning)
* [Cluster Web Console](#Cluster_Web_Console)
This guide provides an overview of the HDFS Federation feature and how to configure and manage the federated cluster.
Background
----------
![HDFS Layers](./images/federation-background.gif)
HDFS has two main layers:
* **Namespace**
* Consists of directories, files and blocks.
* It supports all the namespace related file system operations such as
create, delete, modify and list files and directories.
* **Block Storage Service**, which has two parts:
* Block Management (performed in the Namenode)
* Provides Datanode cluster membership by handling registrations, and periodic heart beats.
* Processes block reports and maintains location of blocks.
* Supports block related operations such as create, delete, modify and
get block location.
* Manages replica placement, block replication for under
replicated blocks, and deletes blocks that are over replicated.
* Storage - is provided by Datanodes by storing blocks on the local file
system and allowing read/write access.
The prior HDFS architecture allows only a single namespace for the entire cluster. In that configuration, a single Namenode manages the namespace. HDFS Federation addresses this limitation by adding support for multiple Namenodes/namespaces to HDFS.
Multiple Namenodes/Namespaces
-----------------------------
In order to scale the name service horizontally, federation uses multiple independent Namenodes/namespaces. The Namenodes are federated; the Namenodes are independent and do not require coordination with each other. The Datanodes are used as common storage for blocks by all the Namenodes. Each Datanode registers with all the Namenodes in the cluster. Datanodes send periodic heartbeats and block reports. They also handle commands from the Namenodes.
Users may use [ViewFs](./ViewFs.html) to create personalized namespace views. ViewFs is analogous to client side mount tables in some Unix/Linux systems.
![HDFS Federation Architecture](./images/federation.gif)
**Block Pool**
A Block Pool is a set of blocks that belong to a single namespace. Datanodes store blocks for all the block pools in the cluster. Each Block Pool is managed independently. This allows a namespace to generate Block IDs for new blocks without the need for coordination with the other namespaces. A Namenode failure does not prevent the Datanode from serving other Namenodes in the cluster.
A Namespace and its block pool together are called Namespace Volume. It is a self-contained unit of management. When a Namenode/namespace is deleted, the corresponding block pool at the Datanodes is deleted. Each namespace volume is upgraded as a unit, during cluster upgrade.
**ClusterID**
A **ClusterID** identifier is used to identify all the nodes in the cluster. When a Namenode is formatted, this identifier is either provided or auto generated. This ID should be used for formatting the other Namenodes into the cluster.
### Key Benefits
* Namespace Scalability - Federation adds namespace horizontal
scaling. Large deployments or deployments using lot of small files
benefit from namespace scaling by allowing more Namenodes to be
added to the cluster.
* Performance - File system throughput is not limited by a single
Namenode. Adding more Namenodes to the cluster scales the file
system read/write throughput.
* Isolation - A single Namenode offers no isolation in a multi user
environment. For example, an experimental application can overload
the Namenode and slow down production critical applications. By using
multiple Namenodes, different categories of applications and users
can be isolated to different namespaces.
Federation Configuration
------------------------
Federation configuration is **backward compatible** and allows existing single Namenode configurations to work without any change. The new configuration is designed such that all the nodes in the cluster have the same configuration without the need for deploying different configurations based on the type of the node in the cluster.
Federation adds a new `NameServiceID` abstraction. A Namenode and its corresponding secondary/backup/checkpointer nodes all belong to a NameServiceId. In order to support a single configuration file, the Namenode and secondary/backup/checkpointer configuration parameters are suffixed with the `NameServiceID`.
### Configuration:
**Step 1**: Add the `dfs.nameservices` parameter to your configuration and configure it with a list of comma separated NameServiceIDs. This will be used by the Datanodes to determine the Namenodes in the cluster.
**Step 2**: For each Namenode and Secondary Namenode/BackupNode/Checkpointer add the following configuration parameters suffixed with the corresponding `NameServiceID` into the common configuration file:
| Daemon | Configuration Parameter |
|:---- |:---- |
| Namenode | `dfs.namenode.rpc-address` `dfs.namenode.servicerpc-address` `dfs.namenode.http-address` `dfs.namenode.https-address` `dfs.namenode.keytab.file` `dfs.namenode.name.dir` `dfs.namenode.edits.dir` `dfs.namenode.checkpoint.dir` `dfs.namenode.checkpoint.edits.dir` |
| Secondary Namenode | `dfs.namenode.secondary.http-address` `dfs.secondary.namenode.keytab.file` |
| BackupNode | `dfs.namenode.backup.address` `dfs.secondary.namenode.keytab.file` |
Here is an example configuration with two Namenodes:
```xml
<configuration>
<property>
<name>dfs.nameservices</name>
<value>ns1,ns2</value>
</property>
<property>
<name>dfs.namenode.rpc-address.ns1</name>
<value>nn-host1:rpc-port</value>
</property>
<property>
<name>dfs.namenode.http-address.ns1</name>
<value>nn-host1:http-port</value>
</property>
<property>
<name>dfs.namenode.secondaryhttp-address.ns1</name>
<value>snn-host1:http-port</value>
</property>
<property>
<name>dfs.namenode.rpc-address.ns2</name>
<value>nn-host2:rpc-port</value>
</property>
<property>
<name>dfs.namenode.http-address.ns2</name>
<value>nn-host2:http-port</value>
</property>
<property>
<name>dfs.namenode.secondaryhttp-address.ns2</name>
<value>snn-host2:http-port</value>
</property>
.... Other common configuration ...
</configuration>
```
### Formatting Namenodes
**Step 1**: Format a Namenode using the following command:
[hdfs]$ $HADOOP_PREFIX/bin/hdfs namenode -format [-clusterId <cluster_id>]
Choose a unique cluster\_id which will not conflict other clusters in your environment. If a cluster\_id is not provided, then a unique one is auto generated.
**Step 2**: Format additional Namenodes using the following command:
[hdfs]$ $HADOOP_PREFIX/bin/hdfs namenode -format -clusterId <cluster_id>
Note that the cluster\_id in step 2 must be same as that of the cluster\_id in step 1. If they are different, the additional Namenodes will not be part of the federated cluster.
### Upgrading from an older release and configuring federation
Older releases only support a single Namenode. Upgrade the cluster to newer release in order to enable federation During upgrade you can provide a ClusterID as follows:
[hdfs]$ $HADOOP_PREFIX/bin/hdfs --daemon start namenode -upgrade -clusterId <cluster_ID>
If cluster\_id is not provided, it is auto generated.
### Adding a new Namenode to an existing HDFS cluster
Perform the following steps:
* Add `dfs.nameservices` to the configuration.
* Update the configuration with the NameServiceID suffix. Configuration
key names changed post release 0.20. You must use the new configuration
parameter names in order to use federation.
* Add the new Namenode related config to the configuration file.
* Propagate the configuration file to the all the nodes in the cluster.
* Start the new Namenode and Secondary/Backup.
* Refresh the Datanodes to pickup the newly added Namenode by running
the following command against all the Datanodes in the cluster:
[hdfs]$ $HADOOP_PREFIX/bin/hdfs dfsadmin -refreshNameNode <datanode_host_name>:<datanode_rpc_port>
Managing the cluster
--------------------
### Starting and stopping cluster
To start the cluster run the following command:
[hdfs]$ $HADOOP_PREFIX/sbin/start-dfs.sh
To stop the cluster run the following command:
[hdfs]$ $HADOOP_PREFIX/sbin/stop-dfs.sh
These commands can be run from any node where the HDFS configuration is available. The command uses the configuration to determine the Namenodes in the cluster and then starts the Namenode process on those nodes. The Datanodes are started on the nodes specified in the `slaves` file. The script can be used as a reference for building your own scripts to start and stop the cluster.
### Balancer
The Balancer has been changed to work with multiple Namenodes. The Balancer can be run using the command:
[hdfs]$ $HADOOP_PREFIX/bin/hdfs --daemon start balancer [-policy <policy>]
The policy parameter can be any of the following:
* `datanode` - this is the *default* policy. This balances the storage at
the Datanode level. This is similar to balancing policy from prior releases.
* `blockpool` - this balances the storage at the block pool
level which also balances at the Datanode level.
Note that Balancer only balances the data and does not balance the namespace.
For the complete command usage, see [balancer](../hadoop-common/CommandsManual.html#balancer).
### Decommissioning
Decommissioning is similar to prior releases. The nodes that need to be decomissioned are added to the exclude file at all of the Namenodes. Each Namenode decommissions its Block Pool. When all the Namenodes finish decommissioning a Datanode, the Datanode is considered decommissioned.
**Step 1**: To distribute an exclude file to all the Namenodes, use the following command:
[hdfs]$ $HADOOP_PREFIX/sbin/distribute-exclude.sh <exclude_file>
**Step 2**: Refresh all the Namenodes to pick up the new exclude file:
[hdfs]$ $HADOOP_PREFIX/sbin/refresh-namenodes.sh
The above command uses HDFS configuration to determine the configured Namenodes in the cluster and refreshes them to pick up the new exclude file.
### Cluster Web Console
Similar to the Namenode status web page, when using federation a Cluster Web Console is available to monitor the federated cluster at `http://<any_nn_host:port>/dfsclusterhealth.jsp`. Any Namenode in the cluster can be used to access this web page.
The Cluster Web Console provides the following information:
* A cluster summary that shows the number of files, number of blocks,
total configured storage capacity, and the available and used storage
for the entire cluster.
* A list of Namenodes and a summary that includes the number of files,
blocks, missing blocks, and live and dead data nodes for each
Namenode. It also provides a link to access each Namenode's web UI.
* The decommissioning status of Datanodes.

View File

@ -0,0 +1,505 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
HDFS Commands Guide
===================
* [Overview](#Overview)
* [User Commands](#User_Commands)
* [classpath](#classpath)
* [dfs](#dfs)
* [fetchdt](#fetchdt)
* [fsck](#fsck)
* [getconf](#getconf)
* [groups](#groups)
* [lsSnapshottableDir](#lsSnapshottableDir)
* [jmxget](#jmxget)
* [oev](#oev)
* [oiv](#oiv)
* [oiv\_legacy](#oiv_legacy)
* [snapshotDiff](#snapshotDiff)
* [version](#version)
* [Administration Commands](#Administration_Commands)
* [balancer](#balancer)
* [cacheadmin](#cacheadmin)
* [crypto](#crypto)
* [datanode](#datanode)
* [dfsadmin](#dfsadmin)
* [haadmin](#haadmin)
* [journalnode](#journalnode)
* [mover](#mover)
* [namenode](#namenode)
* [nfs3](#nfs3)
* [portmap](#portmap)
* [secondarynamenode](#secondarynamenode)
* [storagepolicies](#storagepolicies)
* [zkfc](#zkfc)
* [Debug Commands](#Debug_Commands)
* [verify](#verify)
* [recoverLease](#recoverLease)
Overview
--------
All HDFS commands are invoked by the `bin/hdfs` script. Running the hdfs script without any arguments prints the description for all commands.
Usage: `hdfs [SHELL_OPTIONS] COMMAND [GENERIC_OPTIONS] [COMMAND_OPTIONS]`
Hadoop has an option parsing framework that employs parsing generic options as well as running classes.
| COMMAND\_OPTIONS | Description |
|:---- |:---- |
| SHELL\_OPTIONS | The common set of shell options. These are documented on the [Commands Manual](../../hadoop-project-dist/hadoop-common/CommandsManual.html#Shell_Options) page. |
| GENERIC\_OPTIONS | The common set of options supported by multiple commands. See the Hadoop [Commands Manual](../../hadoop-project-dist/hadoop-common/CommandsManual.html#Generic_Options) for more information. |
| COMMAND COMMAND\_OPTIONS | Various commands with their options are described in the following sections. The commands have been grouped into [User Commands](#User_Commands) and [Administration Commands](#Administration_Commands). |
User Commands
-------------
Commands useful for users of a hadoop cluster.
### `classpath`
Usage: `hdfs classpath`
Prints the class path needed to get the Hadoop jar and the required libraries
### `dfs`
Usage: `hdfs dfs [COMMAND [COMMAND_OPTIONS]]`
Run a filesystem command on the file system supported in Hadoop. The various COMMAND\_OPTIONS can be found at [File System Shell Guide](../hadoop-common/FileSystemShell.html).
### `fetchdt`
Usage: `hdfs fetchdt [--webservice <namenode_http_addr>] <path> `
| COMMAND\_OPTION | Description |
|:---- |:---- |
| `--webservice` *https\_address* | use http protocol instead of RPC |
| *fileName* | File name to store the token into. |
Gets Delegation Token from a NameNode. See [fetchdt](./HdfsUserGuide.html#fetchdt) for more info.
### `fsck`
Usage:
hdfs fsck <path>
[-list-corruptfileblocks |
[-move | -delete | -openforwrite]
[-files [-blocks [-locations | -racks]]]
[-includeSnapshots] [-showprogress]
| COMMAND\_OPTION | Description |
|:---- |:---- |
| *path* | Start checking from this path. |
| `-delete` | Delete corrupted files. |
| `-files` | Print out files being checked. |
| `-files` `-blocks` | Print out the block report |
| `-files` `-blocks` `-locations` | Print out locations for every block. |
| `-files` `-blocks` `-racks` | Print out network topology for data-node locations. |
| `-includeSnapshots` | Include snapshot data if the given path indicates a snapshottable directory or there are snapshottable directories under it. |
| `-list-corruptfileblocks` | Print out list of missing blocks and files they belong to. |
| `-move` | Move corrupted files to /lost+found. |
| `-openforwrite` | Print out files opened for write. |
| `-showprogress` | Print out dots for progress in output. Default is OFF (no progress). |
Runs the HDFS filesystem checking utility. See [fsck](./HdfsUserGuide.html#fsck) for more info.
### `getconf`
Usage:
hdfs getconf -namenodes
hdfs getconf -secondaryNameNodes
hdfs getconf -backupNodes
hdfs getconf -includeFile
hdfs getconf -excludeFile
hdfs getconf -nnRpcAddresses
hdfs getconf -confKey [key]
| COMMAND\_OPTION | Description |
|:---- |:---- |
| `-namenodes` | gets list of namenodes in the cluster. |
| `-secondaryNameNodes` | gets list of secondary namenodes in the cluster. |
| `-backupNodes` | gets list of backup nodes in the cluster. |
| `-includeFile` | gets the include file path that defines the datanodes that can join the cluster. |
| `-excludeFile` | gets the exclude file path that defines the datanodes that need to decommissioned. |
| `-nnRpcAddresses` | gets the namenode rpc addresses |
| `-confKey` [key] | gets a specific key from the configuration |
Gets configuration information from the configuration directory, post-processing.
### `groups`
Usage: `hdfs groups [username ...]`
Returns the group information given one or more usernames.
### `lsSnapshottableDir`
Usage: `hdfs lsSnapshottableDir [-help]`
| COMMAND\_OPTION | Description |
|:---- |:---- |
| `-help` | print help |
Get the list of snapshottable directories. When this is run as a super user, it returns all snapshottable directories. Otherwise it returns those directories that are owned by the current user.
### `jmxget`
Usage: `hdfs jmxget [-localVM ConnectorURL | -port port | -server mbeanserver | -service service]`
| COMMAND\_OPTION | Description |
|:---- |:---- |
| `-help` | print help |
| `-localVM` ConnectorURL | connect to the VM on the same machine |
| `-port` *mbean server port* | specify mbean server port, if missing it will try to connect to MBean Server in the same VM |
| `-service` | specify jmx service, either DataNode or NameNode, the default |
Dump JMX information from a service.
### `oev`
Usage: `hdfs oev [OPTIONS] -i INPUT_FILE -o OUTPUT_FILE`
#### Required command line arguments:
| COMMAND\_OPTION | Description |
|:---- |:---- |
| `-i`,`--inputFile` *arg* | edits file to process, xml (case insensitive) extension means XML format, any other filename means binary format |
| `-o`,`--outputFile` *arg* | Name of output file. If the specified file exists, it will be overwritten, format of the file is determined by -p option |
#### Optional command line arguments:
| COMMAND\_OPTION | Description |
|:---- |:---- |
| `-f`,`--fix-txids` | Renumber the transaction IDs in the input, so that there are no gaps or invalid transaction IDs. |
| `-h`,`--help` | Display usage information and exit |
| `-r`,`--ecover` | When reading binary edit logs, use recovery mode. This will give you the chance to skip corrupt parts of the edit log. |
| `-p`,`--processor` *arg* | Select which type of processor to apply against image file, currently supported processors are: binary (native binary format that Hadoop uses), xml (default, XML format), stats (prints statistics about edits file) |
| `-v`,`--verbose` | More verbose output, prints the input and output filenames, for processors that write to a file, also output to screen. On large image files this will dramatically increase processing time (default is false). |
Hadoop offline edits viewer.
### `oiv`
Usage: `hdfs oiv [OPTIONS] -i INPUT_FILE`
#### Required command line arguments:
| COMMAND\_OPTION | Description |
|:---- |:---- |
| `-i`,`--inputFile` *arg* | edits file to process, xml (case insensitive) extension means XML format, any other filename means binary format |
#### Optional command line arguments:
| COMMAND\_OPTION | Description |
|:---- |:---- |
| `-h`,`--help` | Display usage information and exit |
| `-o`,`--outputFile` *arg* | Name of output file. If the specified file exists, it will be overwritten, format of the file is determined by -p option |
| `-p`,`--processor` *arg* | Select which type of processor to apply against image file, currently supported processors are: binary (native binary format that Hadoop uses), xml (default, XML format), stats (prints statistics about edits file) |
Hadoop Offline Image Viewer for newer image files.
### `oiv_legacy`
Usage: `hdfs oiv_legacy [OPTIONS] -i INPUT_FILE -o OUTPUT_FILE`
| COMMAND\_OPTION | Description |
|:---- |:---- |
| `-h`,`--help` | Display usage information and exit |
| `-i`,`--inputFile` *arg* | edits file to process, xml (case insensitive) extension means XML format, any other filename means binary format |
| `-o`,`--outputFile` *arg* | Name of output file. If the specified file exists, it will be overwritten, format of the file is determined by -p option |
Hadoop offline image viewer for older versions of Hadoop.
### `snapshotDiff`
Usage: `hdfs snapshotDiff <path> <fromSnapshot> <toSnapshot> `
Determine the difference between HDFS snapshots. See the [HDFS Snapshot Documentation](./HdfsSnapshots.html#Get_Snapshots_Difference_Report) for more information.
### `version`
Usage: `hdfs version`
Prints the version.
Administration Commands
-----------------------
Commands useful for administrators of a hadoop cluster.
### `balancer`
Usage: `hdfs balancer [-threshold <threshold>] [-policy <policy>] [-idleiterations <idleiterations>]`
| COMMAND\_OPTION | Description |
|:---- |:---- |
| `-policy` *policy* | `datanode` (default): Cluster is balanced if each datanode is balanced.<br/> `blockpool`: Cluster is balanced if each block pool in each datanode is balanced. |
| `-threshold` *threshold* | Percentage of disk capacity. This overwrites the default threshold. |
| `-idleiterations` *iterations* | Maximum number of idle iterations before exit. This overwrites the default idleiterations(5). |
Runs a cluster balancing utility. An administrator can simply press Ctrl-C to stop the rebalancing process. See [Balancer](./HdfsUserGuide.html#Balancer) for more details.
Note that the `blockpool` policy is more strict than the `datanode` policy.
### `cacheadmin`
Usage: `hdfs cacheadmin -addDirective -path <path> -pool <pool-name> [-force] [-replication <replication>] [-ttl <time-to-live>]`
See the [HDFS Cache Administration Documentation](./CentralizedCacheManagement.html#cacheadmin_command-line_interface) for more information.
### `crypto`
Usage:
hdfs crypto -createZone -keyName <keyName> -path <path>
hdfs crypto -help <command-name>
hdfs crypto -listZones
See the [HDFS Transparent Encryption Documentation](./TransparentEncryption.html#crypto_command-line_interface) for more information.
### `datanode`
Usage: `hdfs datanode [-regular | -rollback | -rollingupgrace rollback]`
| COMMAND\_OPTION | Description |
|:---- |:---- |
| `-regular` | Normal datanode startup (default). |
| `-rollback` | Rollback the datanode to the previous version. This should be used after stopping the datanode and distributing the old hadoop version. |
| `-rollingupgrade` rollback | Rollback a rolling upgrade operation. |
Runs a HDFS datanode.
### `dfsadmin`
Usage:
hdfs dfsadmin [GENERIC_OPTIONS]
[-report [-live] [-dead] [-decommissioning]]
[-safemode enter | leave | get | wait]
[-saveNamespace]
[-rollEdits]
[-restoreFailedStorage true |false |check]
[-refreshNodes]
[-setQuota <quota> <dirname>...<dirname>]
[-clrQuota <dirname>...<dirname>]
[-setSpaceQuota <quota> <dirname>...<dirname>]
[-clrSpaceQuota <dirname>...<dirname>]
[-setStoragePolicy <path> <policyName>]
[-getStoragePolicy <path>]
[-finalizeUpgrade]
[-rollingUpgrade [<query> |<prepare> |<finalize>]]
[-metasave filename]
[-refreshServiceAcl]
[-refreshUserToGroupsMappings]
[-refreshSuperUserGroupsConfiguration]
[-refreshCallQueue]
[-refresh <host:ipc_port> <key> [arg1..argn]]
[-reconfig <datanode |...> <host:ipc_port> <start |status>]
[-printTopology]
[-refreshNamenodes datanodehost:port]
[-deleteBlockPool datanode-host:port blockpoolId [force]]
[-setBalancerBandwidth <bandwidth in bytes per second>]
[-allowSnapshot <snapshotDir>]
[-disallowSnapshot <snapshotDir>]
[-fetchImage <local directory>]
[-shutdownDatanode <datanode_host:ipc_port> [upgrade]]
[-getDatanodeInfo <datanode_host:ipc_port>]
[-triggerBlockReport [-incremental] <datanode_host:ipc_port>]
[-help [cmd]]
| COMMAND\_OPTION | Description |
|:---- |:---- |
| `-report` `[-live]` `[-dead]` `[-decommissioning]` | Reports basic filesystem information and statistics. Optional flags may be used to filter the list of displayed DataNodes. |
| `-safemode` enter\|leave\|get\|wait | Safe mode maintenance command. Safe mode is a Namenode state in which it <br/>1. does not accept changes to the name space (read-only) <br/>2. does not replicate or delete blocks. <br/>Safe mode is entered automatically at Namenode startup, and leaves safe mode automatically when the configured minimum percentage of blocks satisfies the minimum replication condition. Safe mode can also be entered manually, but then it can only be turned off manually as well. |
| `-saveNamespace` | Save current namespace into storage directories and reset edits log. Requires safe mode. |
| `-rollEdits` | Rolls the edit log on the active NameNode. |
| `-restoreFailedStorage` true\|false\|check | This option will turn on/off automatic attempt to restore failed storage replicas. If a failed storage becomes available again the system will attempt to restore edits and/or fsimage during checkpoint. 'check' option will return current setting. |
| `-refreshNodes` | Re-read the hosts and exclude files to update the set of Datanodes that are allowed to connect to the Namenode and those that should be decommissioned or recommissioned. |
| `-setQuota` \<quota\> \<dirname\>...\<dirname\> | See [HDFS Quotas Guide](../hadoop-hdfs/HdfsQuotaAdminGuide.html#Administrative_Commands) for the detail. |
| `-clrQuota` \<dirname\>...\<dirname\> | See [HDFS Quotas Guide](../hadoop-hdfs/HdfsQuotaAdminGuide.html#Administrative_Commands) for the detail. |
| `-setSpaceQuota` \<quota\> \<dirname\>...\<dirname\> | See [HDFS Quotas Guide](../hadoop-hdfs/HdfsQuotaAdminGuide.html#Administrative_Commands) for the detail. |
| `-clrSpaceQuota` \<dirname\>...\<dirname\> | See [HDFS Quotas Guide](../hadoop-hdfs/HdfsQuotaAdminGuide.html#Administrative_Commands) for the detail. |
| `-setStoragePolicy` \<path\> \<policyName\> | Set a storage policy to a file or a directory. |
| `-getStoragePolicy` \<path\> | Get the storage policy of a file or a directory. |
| `-finalizeUpgrade` | Finalize upgrade of HDFS. Datanodes delete their previous version working directories, followed by Namenode doing the same. This completes the upgrade process. |
| `-rollingUpgrade` [\<query\>\|\<prepare\>\|\<finalize\>] | See [Rolling Upgrade document](../hadoop-hdfs/HdfsRollingUpgrade.html#dfsadmin_-rollingUpgrade) for the detail. |
| `-metasave` filename | Save Namenode's primary data structures to *filename* in the directory specified by hadoop.log.dir property. *filename* is overwritten if it exists. *filename* will contain one line for each of the following<br/>1. Datanodes heart beating with Namenode<br/>2. Blocks waiting to be replicated<br/>3. Blocks currently being replicated<br/>4. Blocks waiting to be deleted |
| `-refreshServiceAcl` | Reload the service-level authorization policy file. |
| `-refreshUserToGroupsMappings` | Refresh user-to-groups mappings. |
| `-refreshSuperUserGroupsConfiguration` | Refresh superuser proxy groups mappings |
| `-refreshCallQueue` | Reload the call queue from config. |
| `-refresh` \<host:ipc\_port\> \<key\> [arg1..argn] | Triggers a runtime-refresh of the resource specified by \<key\> on \<host:ipc\_port\>. All other args after are sent to the host. |
| `-reconfig` \<datanode \|...\> \<host:ipc\_port\> \<start\|status\> | Start reconfiguration or get the status of an ongoing reconfiguration. The second parameter specifies the node type. Currently, only reloading DataNode's configuration is supported. |
| `-printTopology` | Print a tree of the racks and their nodes as reported by the Namenode |
| `-refreshNamenodes` datanodehost:port | For the given datanode, reloads the configuration files, stops serving the removed block-pools and starts serving new block-pools. |
| `-deleteBlockPool` datanode-host:port blockpoolId [force] | If force is passed, block pool directory for the given blockpool id on the given datanode is deleted along with its contents, otherwise the directory is deleted only if it is empty. The command will fail if datanode is still serving the block pool. Refer to refreshNamenodes to shutdown a block pool service on a datanode. |
| `-setBalancerBandwidth` \<bandwidth in bytes per second\> | Changes the network bandwidth used by each datanode during HDFS block balancing. \<bandwidth\> is the maximum number of bytes per second that will be used by each datanode. This value overrides the dfs.balance.bandwidthPerSec parameter. NOTE: The new value is not persistent on the DataNode. |
| `-allowSnapshot` \<snapshotDir\> | Allowing snapshots of a directory to be created. If the operation completes successfully, the directory becomes snapshottable. See the [HDFS Snapshot Documentation](./HdfsSnapshots.html) for more information. |
| `-disallowSnapshot` \<snapshotDir\> | Disallowing snapshots of a directory to be created. All snapshots of the directory must be deleted before disallowing snapshots. See the [HDFS Snapshot Documentation](./HdfsSnapshots.html) for more information. |
| `-fetchImage` \<local directory\> | Downloads the most recent fsimage from the NameNode and saves it in the specified local directory. |
| `-shutdownDatanode` \<datanode\_host:ipc\_port\> [upgrade] | Submit a shutdown request for the given datanode. See [Rolling Upgrade document](./HdfsRollingUpgrade.html#dfsadmin_-shutdownDatanode) for the detail. |
| `-getDatanodeInfo` \<datanode\_host:ipc\_port\> | Get the information about the given datanode. See [Rolling Upgrade document](./HdfsRollingUpgrade.html#dfsadmin_-getDatanodeInfo) for the detail. |
| `-triggerBlockReport` `[-incremental]` \<datanode\_host:ipc\_port\> | Trigger a block report for the given datanode. If 'incremental' is specified, it will be otherwise, it will be a full block report. |
| `-help` [cmd] | Displays help for the given command or all commands if none is specified. |
Runs a HDFS dfsadmin client.
### `haadmin`
Usage:
hdfs haadmin -checkHealth <serviceId>
hdfs haadmin -failover [--forcefence] [--forceactive] <serviceId> <serviceId>
hdfs haadmin -getServiceState <serviceId>
hdfs haadmin -help <command>
hdfs haadmin -transitionToActive <serviceId> [--forceactive]
hdfs haadmin -transitionToStandby <serviceId>
| COMMAND\_OPTION | Description |
|:---- |:---- |
| `-checkHealth` | check the health of the given NameNode |
| `-failover` | initiate a failover between two NameNodes |
| `-getServiceState` | determine whether the given NameNode is Active or Standby |
| `-transitionToActive` | transition the state of the given NameNode to Active (Warning: No fencing is done) |
| `-transitionToStandby` | transition the state of the given NameNode to Standby (Warning: No fencing is done) |
See [HDFS HA with NFS](./HDFSHighAvailabilityWithNFS.html#Administrative_commands) or [HDFS HA with QJM](./HDFSHighAvailabilityWithQJM.html#Administrative_commands) for more information on this command.
### `journalnode`
Usage: `hdfs journalnode`
This comamnd starts a journalnode for use with [HDFS HA with QJM](./HDFSHighAvailabilityWithQJM.html#Administrative_commands).
### `mover`
Usage: `hdfs mover [-p <files/dirs> | -f <local file name>]`
| COMMAND\_OPTION | Description |
|:---- |:---- |
| `-f` \<local file\> | Specify a local file containing a list of HDFS files/dirs to migrate. |
| `-p` \<files/dirs\> | Specify a space separated list of HDFS files/dirs to migrate. |
Runs the data migration utility. See [Mover](./ArchivalStorage.html#Mover_-_A_New_Data_Migration_Tool) for more details.
Note that, when both -p and -f options are omitted, the default path is the root directory.
### `namenode`
Usage:
hdfs namenode [-backup] |
[-checkpoint] |
[-format [-clusterid cid ] [-force] [-nonInteractive] ] |
[-upgrade [-clusterid cid] [-renameReserved<k-v pairs>] ] |
[-upgradeOnly [-clusterid cid] [-renameReserved<k-v pairs>] ] |
[-rollback] |
[-rollingUpgrade <downgrade |rollback> ] |
[-finalize] |
[-importCheckpoint] |
[-initializeSharedEdits] |
[-bootstrapStandby] |
[-recover [-force] ] |
[-metadataVersion ]
| COMMAND\_OPTION | Description |
|:---- |:---- |
| `-backup` | Start backup node. |
| `-checkpoint` | Start checkpoint node. |
| `-format` `[-clusterid cid]` `[-force]` `[-nonInteractive]` | Formats the specified NameNode. It starts the NameNode, formats it and then shut it down. -force option formats if the name directory exists. -nonInteractive option aborts if the name directory exists, unless -force option is specified. |
| `-upgrade` `[-clusterid cid]` [`-renameReserved` \<k-v pairs\>] | Namenode should be started with upgrade option after the distribution of new Hadoop version. |
| `-upgradeOnly` `[-clusterid cid]` [`-renameReserved` \<k-v pairs\>] | Upgrade the specified NameNode and then shutdown it. |
| `-rollback` | Rollback the NameNode to the previous version. This should be used after stopping the cluster and distributing the old Hadoop version. |
| `-rollingUpgrade` \<downgrade\|rollback\|started\> | See [Rolling Upgrade document](./HdfsRollingUpgrade.html#NameNode_Startup_Options) for the detail. |
| `-finalize` | Finalize will remove the previous state of the files system. Recent upgrade will become permanent. Rollback option will not be available anymore. After finalization it shuts the NameNode down. |
| `-importCheckpoint` | Loads image from a checkpoint directory and save it into the current one. Checkpoint dir is read from property fs.checkpoint.dir |
| `-initializeSharedEdits` | Format a new shared edits dir and copy in enough edit log segments so that the standby NameNode can start up. |
| `-bootstrapStandby` | Allows the standby NameNode's storage directories to be bootstrapped by copying the latest namespace snapshot from the active NameNode. This is used when first configuring an HA cluster. |
| `-recover` `[-force]` | Recover lost metadata on a corrupt filesystem. See [HDFS User Guide](./HdfsUserGuide.html#Recovery_Mode) for the detail. |
| `-metadataVersion` | Verify that configured directories exist, then print the metadata versions of the software and the image. |
Runs the namenode. More info about the upgrade, rollback and finalize is at [Upgrade Rollback](./HdfsUserGuide.html#Upgrade_and_Rollback).
### `nfs3`
Usage: `hdfs nfs3`
This comamnd starts the NFS3 gateway for use with the [HDFS NFS3 Service](./HdfsNfsGateway.html#Start_and_stop_NFS_gateway_service).
### `portmap`
Usage: `hdfs portmap`
This comamnd starts the RPC portmap for use with the [HDFS NFS3 Service](./HdfsNfsGateway.html#Start_and_stop_NFS_gateway_service).
### `secondarynamenode`
Usage: `hdfs secondarynamenode [-checkpoint [force]] | [-format] | [-geteditsize]`
| COMMAND\_OPTION | Description |
|:---- |:---- |
| `-checkpoint` [force] | Checkpoints the SecondaryNameNode if EditLog size \>= fs.checkpoint.size. If `force` is used, checkpoint irrespective of EditLog size. |
| `-format` | Format the local storage during startup. |
| `-geteditsize` | Prints the number of uncheckpointed transactions on the NameNode. |
Runs the HDFS secondary namenode. See [Secondary Namenode](./HdfsUserGuide.html#Secondary_NameNode) for more info.
### `storagepolicies`
Usage: `hdfs storagepolicies`
Lists out all storage policies. See the [HDFS Storage Policy Documentation](./ArchivalStorage.html) for more information.
### `zkfc`
Usage: `hdfs zkfc [-formatZK [-force] [-nonInteractive]]`
| COMMAND\_OPTION | Description |
|:---- |:---- |
| `-formatZK` | Format the Zookeeper instance |
| `-h` | Display help |
This comamnd starts a Zookeeper Failover Controller process for use with [HDFS HA with QJM](./HDFSHighAvailabilityWithQJM.html#Administrative_commands).
Debug Commands
--------------
Useful commands to help administrators debug HDFS issues, like validating block files and calling recoverLease.
### `verify`
Usage: `hdfs dfs verify [-meta <metadata-file>] [-block <block-file>]`
| COMMAND\_OPTION | Description |
|:---- |:---- |
| `-block` *block-file* | Optional parameter to specify the absolute path for the block file on the local file system of the data node. |
| `-meta` *metadata-file* | Absolute path for the metadata file on the local file system of the data node. |
Verify HDFS metadata and block files. If a block file is specified, we will verify that the checksums in the metadata file match the block file.
### `recoverLease`
Usage: `hdfs dfs recoverLease [-path <path>] [-retries <num-retries>]`
| COMMAND\_OPTION | Description |
|:---- |:---- |
| [`-path` *path*] | HDFS path for which to recover the lease. |
| [`-retries` *num-retries*] | Number of times the client will retry calling recoverLease. The default number of retries is 1. |
Recover the lease on the specified path. The path must reside on an HDFS filesystem. The default number of retries is 1.

View File

@ -0,0 +1,678 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
HDFS High Availability
======================
* [HDFS High Availability](#HDFS_High_Availability)
* [Purpose](#Purpose)
* [Note: Using the Quorum Journal Manager or Conventional Shared Storage](#Note:_Using_the_Quorum_Journal_Manager_or_Conventional_Shared_Storage)
* [Background](#Background)
* [Architecture](#Architecture)
* [Hardware resources](#Hardware_resources)
* [Deployment](#Deployment)
* [Configuration overview](#Configuration_overview)
* [Configuration details](#Configuration_details)
* [Deployment details](#Deployment_details)
* [Administrative commands](#Administrative_commands)
* [Automatic Failover](#Automatic_Failover)
* [Introduction](#Introduction)
* [Components](#Components)
* [Deploying ZooKeeper](#Deploying_ZooKeeper)
* [Before you begin](#Before_you_begin)
* [Configuring automatic failover](#Configuring_automatic_failover)
* [Initializing HA state in ZooKeeper](#Initializing_HA_state_in_ZooKeeper)
* [Starting the cluster with start-dfs.sh](#Starting_the_cluster_with_start-dfs.sh)
* [Starting the cluster manually](#Starting_the_cluster_manually)
* [Securing access to ZooKeeper](#Securing_access_to_ZooKeeper)
* [Verifying automatic failover](#Verifying_automatic_failover)
* [Automatic Failover FAQ](#Automatic_Failover_FAQ)
* [BookKeeper as a Shared storage (EXPERIMENTAL)](#BookKeeper_as_a_Shared_storage_EXPERIMENTAL)
Purpose
-------
This guide provides an overview of the HDFS High Availability (HA) feature and how to configure and manage an HA HDFS cluster, using NFS for the shared storage required by the NameNodes.
This document assumes that the reader has a general understanding of general components and node types in an HDFS cluster. Please refer to the HDFS Architecture guide for details.
Note: Using the Quorum Journal Manager or Conventional Shared Storage
---------------------------------------------------------------------
This guide discusses how to configure and use HDFS HA using a shared NFS directory to share edit logs between the Active and Standby NameNodes. For information on how to configure HDFS HA using the Quorum Journal Manager instead of NFS, please see [this alternative guide.](./HDFSHighAvailabilityWithQJM.html)
Background
----------
Prior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster. Each cluster had a single NameNode, and if that machine or process became unavailable, the cluster as a whole would be unavailable until the NameNode was either restarted or brought up on a separate machine.
This impacted the total availability of the HDFS cluster in two major ways:
* In the case of an unplanned event such as a machine crash, the cluster would
be unavailable until an operator restarted the NameNode.
* Planned maintenance events such as software or hardware upgrades on the
NameNode machine would result in windows of cluster downtime.
The HDFS High Availability feature addresses the above problems by providing the option of running two redundant NameNodes in the same cluster in an Active/Passive configuration with a hot standby. This allows a fast failover to a new NameNode in the case that a machine crashes, or a graceful administrator-initiated failover for the purpose of planned maintenance.
Architecture
------------
In a typical HA cluster, two separate machines are configured as NameNodes. At any point in time, exactly one of the NameNodes is in an *Active* state, and the other is in a *Standby* state. The Active NameNode is responsible for all client operations in the cluster, while the Standby is simply acting as a slave, maintaining enough state to provide a fast failover if necessary.
In order for the Standby node to keep its state synchronized with the Active node, the current implementation requires that the two nodes both have access to a directory on a shared storage device (eg an NFS mount from a NAS). This restriction will likely be relaxed in future versions.
When any namespace modification is performed by the Active node, it durably logs a record of the modification to an edit log file stored in the shared directory. The Standby node is constantly watching this directory for edits, and as it sees the edits, it applies them to its own namespace. In the event of a failover, the Standby will ensure that it has read all of the edits from the shared storage before promoting itself to the Active state. This ensures that the namespace state is fully synchronized before a failover occurs.
In order to provide a fast failover, it is also necessary that the Standby node have up-to-date information regarding the location of blocks in the cluster. In order to achieve this, the DataNodes are configured with the location of both NameNodes, and send block location information and heartbeats to both.
It is vital for the correct operation of an HA cluster that only one of the NameNodes be Active at a time. Otherwise, the namespace state would quickly diverge between the two, risking data loss or other incorrect results. In order to ensure this property and prevent the so-called "split-brain scenario," the administrator must configure at least one *fencing method* for the shared storage. During a failover, if it cannot be verified that the previous Active node has relinquished its Active state, the fencing process is responsible for cutting off the previous Active's access to the shared edits storage. This prevents it from making any further edits to the namespace, allowing the new Active to safely proceed with failover.
Hardware resources
------------------
In order to deploy an HA cluster, you should prepare the following:
* **NameNode machines** - the machines on which you run the Active and Standby NameNodes should have equivalent hardware to each other, and equivalent hardware to what would be used in a non-HA cluster.
* **Shared storage** - you will need to have a shared directory which both NameNode machines can have read/write access to. Typically this is a remote filer which supports NFS and is mounted on each of the NameNode machines. Currently only a single shared edits directory is supported. Thus, the availability of the system is limited by the availability of this shared edits directory, and therefore in order to remove all single points of failure there needs to be redundancy for the shared edits directory. Specifically, multiple network paths to the storage, and redundancy in the storage itself (disk, network, and power). Beacuse of this, it is recommended that the shared storage server be a high-quality dedicated NAS appliance rather than a simple Linux server.
Note that, in an HA cluster, the Standby NameNode also performs checkpoints of the namespace state, and thus it is not necessary to run a Secondary NameNode, CheckpointNode, or BackupNode in an HA cluster. In fact, to do so would be an error. This also allows one who is reconfiguring a non-HA-enabled HDFS cluster to be HA-enabled to reuse the hardware which they had previously dedicated to the Secondary NameNode.
Deployment
----------
### Configuration overview
Similar to Federation configuration, HA configuration is backward compatible and allows existing single NameNode configurations to work without change. The new configuration is designed such that all the nodes in the cluster may have the same configuration without the need for deploying different configuration files to different machines based on the type of the node.
Like HDFS Federation, HA clusters reuse the `nameservice ID` to identify a single HDFS instance that may in fact consist of multiple HA NameNodes. In addition, a new abstraction called `NameNode ID` is added with HA. Each distinct NameNode in the cluster has a different NameNode ID to distinguish it. To support a single configuration file for all of the NameNodes, the relevant configuration parameters are suffixed with the **nameservice ID** as well as the **NameNode ID**.
### Configuration details
To configure HA NameNodes, you must add several configuration options to your **hdfs-site.xml** configuration file.
The order in which you set these configurations is unimportant, but the values you choose for **dfs.nameservices** and **dfs.ha.namenodes.[nameservice ID]** will determine the keys of those that follow. Thus, you should decide on these values before setting the rest of the configuration options.
* **dfs.nameservices** - the logical name for this new nameservice
Choose a logical name for this nameservice, for example "mycluster", and use
this logical name for the value of this config option. The name you choose is
arbitrary. It will be used both for configuration and as the authority
component of absolute HDFS paths in the cluster.
**Note:** If you are also using HDFS Federation, this configuration setting should also include the list of other nameservices, HA or otherwise, as a comma-separated list.
<property>
<name>dfs.nameservices</name>
<value>mycluster</value>
</property>
* **dfs.ha.namenodes.[nameservice ID]** - unique identifiers for each NameNode in the nameservice
Configure with a list of comma-separated NameNode IDs. This will be used by
DataNodes to determine all the NameNodes in the cluster. For example, if you
used "mycluster" as the nameservice ID previously, and you wanted to use "nn1"
and "nn2" as the individual IDs of the NameNodes, you would configure this as
such:
<property>
<name>dfs.ha.namenodes.mycluster</name>
<value>nn1,nn2</value>
</property>
**Note:** Currently, only a maximum of two NameNodes may be configured per
nameservice.
* **dfs.namenode.rpc-address.[nameservice ID].[name node ID]** - the fully-qualified RPC address for each NameNode to listen on
For both of the previously-configured NameNode IDs, set the full address and
IPC port of the NameNode processs. Note that this results in two separate
configuration options. For example:
<property>
<name>dfs.namenode.rpc-address.mycluster.nn1</name>
<value>machine1.example.com:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.mycluster.nn2</name>
<value>machine2.example.com:8020</value>
</property>
**Note:** You may similarly configure the "**servicerpc-address**" setting if
you so desire.
* **dfs.namenode.http-address.[nameservice ID].[name node ID]** - the fully-qualified HTTP address for each NameNode to listen on
Similarly to *rpc-address* above, set the addresses for both NameNodes' HTTP
servers to listen on. For example:
<property>
<name>dfs.namenode.http-address.mycluster.nn1</name>
<value>machine1.example.com:50070</value>
</property>
<property>
<name>dfs.namenode.http-address.mycluster.nn2</name>
<value>machine2.example.com:50070</value>
</property>
**Note:** If you have Hadoop's security features enabled, you should also set
the *https-address* similarly for each NameNode.
* **dfs.namenode.shared.edits.dir** - the location of the shared storage directory
This is where one configures the path to the remote shared edits directory
which the Standby NameNode uses to stay up-to-date with all the file system
changes the Active NameNode makes. **You should only configure one of these
directories.** This directory should be mounted r/w on both NameNode machines.
The value of this setting should be the absolute path to this directory on the
NameNode machines. For example:
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>file:///mnt/filer1/dfs/ha-name-dir-shared</value>
</property>
* **dfs.client.failover.proxy.provider.[nameservice ID]** - the Java class that HDFS clients use to contact the Active NameNode
Configure the name of the Java class which will be used by the DFS Client to
determine which NameNode is the current Active, and therefore which NameNode is
currently serving client requests. The only implementation which currently
ships with Hadoop is the **ConfiguredFailoverProxyProvider**, so use this
unless you are using a custom one. For example:
<property>
<name>dfs.client.failover.proxy.provider.mycluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
* **dfs.ha.fencing.methods** - a list of scripts or Java classes which will be used to fence the Active NameNode during a failover
It is critical for correctness of the system that only one NameNode be in the
Active state at any given time. Thus, during a failover, we first ensure that
the Active NameNode is either in the Standby state, or the process has
terminated, before transitioning the other NameNode to the Active state. In
order to do this, you must configure at least one **fencing method.** These are
configured as a carriage-return-separated list, which will be attempted in order
until one indicates that fencing has succeeded. There are two methods which
ship with Hadoop: *shell* and *sshfence*. For information on implementing
your own custom fencing method, see the *org.apache.hadoop.ha.NodeFencer* class.
- - -
**sshfence** - SSH to the Active NameNode and kill the process
The *sshfence* option SSHes to the target node and uses *fuser* to kill
the process listening on the service's TCP port. In order for this fencing option
to work, it must be able to SSH to the target node without providing a
passphrase. Thus, one must also configure the
**dfs.ha.fencing.ssh.private-key-files** option, which is a
comma-separated list of SSH private key files. For example:
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/home/exampleuser/.ssh/id_rsa</value>
</property>
Optionally, one may configure a non-standard username or port to perform the
SSH. One may also configure a timeout, in milliseconds, for the SSH, after
which this fencing method will be considered to have failed. It may be
configured like so:
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence([[username][:port]])</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.connect-timeout</name>
<value>30000</value>
</property>
- - -
**shell** - run an arbitrary shell command to fence the Active NameNode
The *shell* fencing method runs an arbitrary shell command. It may be
configured like so:
<property>
<name>dfs.ha.fencing.methods</name>
<value>shell(/path/to/my/script.sh arg1 arg2 ...)</value>
</property>
The string between '(' and ')' is passed directly to a bash shell and may not
include any closing parentheses.
The shell command will be run with an environment set up to contain all of the
current Hadoop configuration variables, with the '\_' character replacing any
'.' characters in the configuration keys. The configuration used has already had
any namenode-specific configurations promoted to their generic forms -- for example
**dfs\_namenode\_rpc-address** will contain the RPC address of the target node, even
though the configuration may specify that variable as
**dfs.namenode.rpc-address.ns1.nn1**.
Additionally, the following variables referring to the target node to be fenced
are also available:
| | |
|:---- |:---- |
| $target\_host | hostname of the node to be fenced |
| $target\_port | IPC port of the node to be fenced |
| $target\_address | the above two, combined as host:port |
| $target\_nameserviceid | the nameservice ID of the NN to be fenced |
| $target\_namenodeid | the namenode ID of the NN to be fenced |
These environment variables may also be used as substitutions in the shell
command itself. For example:
<property>
<name>dfs.ha.fencing.methods</name>
<value>shell(/path/to/my/script.sh --nameservice=$target_nameserviceid $target_host:$target_port)</value>
</property>
If the shell command returns an exit
code of 0, the fencing is determined to be successful. If it returns any other
exit code, the fencing was not successful and the next fencing method in the
list will be attempted.
**Note:** This fencing method does not implement any timeout. If timeouts are
necessary, they should be implemented in the shell script itself (eg by forking
a subshell to kill its parent in some number of seconds).
- - -
* **fs.defaultFS** - the default path prefix used by the Hadoop FS client when none is given
Optionally, you may now configure the default path for Hadoop clients to use
the new HA-enabled logical URI. If you used "mycluster" as the nameservice ID
earlier, this will be the value of the authority portion of all of your HDFS
paths. This may be configured like so, in your **core-site.xml** file:
<property>
<name>fs.defaultFS</name>
<value>hdfs://mycluster</value>
</property>
### Deployment details
After all of the necessary configuration options have been set, one must initially synchronize the two HA NameNodes' on-disk metadata.
* If you are setting up a fresh HDFS cluster, you should first run the format
command (*hdfs namenode -format*) on one of NameNodes.
* If you have already formatted the NameNode, or are converting a
non-HA-enabled cluster to be HA-enabled, you should now copy over the
contents of your NameNode metadata directories to the other, unformatted
NameNode by running the command "*hdfs namenode -bootstrapStandby*" on the
unformatted NameNode. Running this command will also ensure that the shared
edits directory (as configured by **dfs.namenode.shared.edits.dir**) contains
sufficient edits transactions to be able to start both NameNodes.
* If you are converting a non-HA NameNode to be HA, you should run the
command "*hdfs -initializeSharedEdits*", which will initialize the shared
edits directory with the edits data from the local NameNode edits directories.
At this point you may start both of your HA NameNodes as you normally would start a NameNode.
You can visit each of the NameNodes' web pages separately by browsing to their configured HTTP addresses. You should notice that next to the configured address will be the HA state of the NameNode (either "standby" or "active".) Whenever an HA NameNode starts, it is initially in the Standby state.
### Administrative commands
Now that your HA NameNodes are configured and started, you will have access to some additional commands to administer your HA HDFS cluster. Specifically, you should familiarize yourself with all of the subcommands of the "*hdfs haadmin*" command. Running this command without any additional arguments will display the following usage information:
Usage: DFSHAAdmin [-ns <nameserviceId>]
[-transitionToActive <serviceId>]
[-transitionToStandby <serviceId>]
[-failover [--forcefence] [--forceactive] <serviceId> <serviceId>]
[-getServiceState <serviceId>]
[-checkHealth <serviceId>]
[-help <command>]
This guide describes high-level uses of each of these subcommands. For specific usage information of each subcommand, you should run "*hdfs haadmin -help \<command*\>".
* **transitionToActive** and **transitionToStandby** - transition the state of the given NameNode to Active or Standby
These subcommands cause a given NameNode to transition to the Active or Standby
state, respectively. **These commands do not attempt to perform any fencing,
and thus should rarely be used.** Instead, one should almost always prefer to
use the "*hdfs haadmin -failover*" subcommand.
* **failover** - initiate a failover between two NameNodes
This subcommand causes a failover from the first provided NameNode to the
second. If the first NameNode is in the Standby state, this command simply
transitions the second to the Active state without error. If the first NameNode
is in the Active state, an attempt will be made to gracefully transition it to
the Standby state. If this fails, the fencing methods (as configured by
**dfs.ha.fencing.methods**) will be attempted in order until one
succeeds. Only after this process will the second NameNode be transitioned to
the Active state. If no fencing method succeeds, the second NameNode will not
be transitioned to the Active state, and an error will be returned.
* **getServiceState** - determine whether the given NameNode is Active or Standby
Connect to the provided NameNode to determine its current state, printing
either "standby" or "active" to STDOUT appropriately. This subcommand might be
used by cron jobs or monitoring scripts which need to behave differently based
on whether the NameNode is currently Active or Standby.
* **checkHealth** - check the health of the given NameNode
Connect to the provided NameNode to check its health. The NameNode is capable
of performing some diagnostics on itself, including checking if internal
services are running as expected. This command will return 0 if the NameNode
is healthy, non-zero otherwise. One might use this command for monitoring purposes.
**Note:** This is not yet implemented, and at present will always return
success, unless the given NameNode is completely down.
Automatic Failover
------------------
### Introduction
The above sections describe how to configure manual failover. In that mode, the system will not automatically trigger a failover from the active to the standby NameNode, even if the active node has failed. This section describes how to configure and deploy automatic failover.
### Components
Automatic failover adds two new components to an HDFS deployment: a ZooKeeper quorum, and the ZKFailoverController process (abbreviated as ZKFC).
Apache ZooKeeper is a highly available service for maintaining small amounts of coordination data, notifying clients of changes in that data, and monitoring clients for failures. The implementation of automatic HDFS failover relies on ZooKeeper for the following things:
* **Failure detection** - each of the NameNode machines in the cluster
maintains a persistent session in ZooKeeper. If the machine crashes, the
ZooKeeper session will expire, notifying the other NameNode that a failover
should be triggered.
* **Active NameNode election** - ZooKeeper provides a simple mechanism to
exclusively elect a node as active. If the current active NameNode crashes,
another node may take a special exclusive lock in ZooKeeper indicating that
it should become the next active.
The ZKFailoverController (ZKFC) is a new component which is a ZooKeeper client which also monitors and manages the state of the NameNode. Each of the machines which runs a NameNode also runs a ZKFC, and that ZKFC is responsible for:
* **Health monitoring** - the ZKFC pings its local NameNode on a periodic
basis with a health-check command. So long as the NameNode responds in a
timely fashion with a healthy status, the ZKFC considers the node
healthy. If the node has crashed, frozen, or otherwise entered an unhealthy
state, the health monitor will mark it as unhealthy.
* **ZooKeeper session management** - when the local NameNode is healthy, the
ZKFC holds a session open in ZooKeeper. If the local NameNode is active, it
also holds a special "lock" znode. This lock uses ZooKeeper's support for
"ephemeral" nodes; if the session expires, the lock node will be
automatically deleted.
* **ZooKeeper-based election** - if the local NameNode is healthy, and the
ZKFC sees that no other node currently holds the lock znode, it will itself
try to acquire the lock. If it succeeds, then it has "won the election", and
is responsible for running a failover to make its local NameNode active. The
failover process is similar to the manual failover described above: first,
the previous active is fenced if necessary, and then the local NameNode
transitions to active state.
For more details on the design of automatic failover, refer to the design document attached to HDFS-2185 on the Apache HDFS JIRA.
### Deploying ZooKeeper
In a typical deployment, ZooKeeper daemons are configured to run on three or five nodes. Since ZooKeeper itself has light resource requirements, it is acceptable to collocate the ZooKeeper nodes on the same hardware as the HDFS NameNode and Standby Node. Many operators choose to deploy the third ZooKeeper process on the same node as the YARN ResourceManager. It is advisable to configure the ZooKeeper nodes to store their data on separate disk drives from the HDFS metadata for best performance and isolation.
The setup of ZooKeeper is out of scope for this document. We will assume that you have set up a ZooKeeper cluster running on three or more nodes, and have verified its correct operation by connecting using the ZK CLI.
### Before you begin
Before you begin configuring automatic failover, you should shut down your cluster. It is not currently possible to transition from a manual failover setup to an automatic failover setup while the cluster is running.
### Configuring automatic failover
The configuration of automatic failover requires the addition of two new parameters to your configuration. In your `hdfs-site.xml` file, add:
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
This specifies that the cluster should be set up for automatic failover. In your `core-site.xml` file, add:
<property>
<name>ha.zookeeper.quorum</name>
<value>zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181</value>
</property>
This lists the host-port pairs running the ZooKeeper service.
As with the parameters described earlier in the document, these settings may be configured on a per-nameservice basis by suffixing the configuration key with the nameservice ID. For example, in a cluster with federation enabled, you can explicitly enable automatic failover for only one of the nameservices by setting `dfs.ha.automatic-failover.enabled.my-nameservice-id`.
There are also several other configuration parameters which may be set to control the behavior of automatic failover; however, they are not necessary for most installations. Please refer to the configuration key specific documentation for details.
### Initializing HA state in ZooKeeper
After the configuration keys have been added, the next step is to initialize required state in ZooKeeper. You can do so by running the following command from one of the NameNode hosts.
[hdfs]$ $HADOOP_PREFIX/bin/zkfc -formatZK
This will create a znode in ZooKeeper inside of which the automatic failover system stores its data.
### Starting the cluster with `start-dfs.sh`
Since automatic failover has been enabled in the configuration, the `start-dfs.sh` script will now automatically start a ZKFC daemon on any machine that runs a NameNode. When the ZKFCs start, they will automatically select one of the NameNodes to become active.
### Starting the cluster manually
If you manually manage the services on your cluster, you will need to manually start the `zkfc` daemon on each of the machines that runs a NameNode. You can start the daemon by running:
[hdfs]$ $HADOOP_PREFIX/bin/hdfs --daemon start zkfc
### Securing access to ZooKeeper
If you are running a secure cluster, you will likely want to ensure that the information stored in ZooKeeper is also secured. This prevents malicious clients from modifying the metadata in ZooKeeper or potentially triggering a false failover.
In order to secure the information in ZooKeeper, first add the following to your `core-site.xml` file:
<property>
<name>ha.zookeeper.auth</name>
<value>@/path/to/zk-auth.txt</value>
</property>
<property>
<name>ha.zookeeper.acl</name>
<value>@/path/to/zk-acl.txt</value>
</property>
Please note the '@' character in these values -- this specifies that the configurations are not inline, but rather point to a file on disk.
The first configured file specifies a list of ZooKeeper authentications, in the same format as used by the ZK CLI. For example, you may specify something like:
digest:hdfs-zkfcs:mypassword
...where `hdfs-zkfcs` is a unique username for ZooKeeper, and `mypassword` is some unique string used as a password.
Next, generate a ZooKeeper ACL that corresponds to this authentication, using a command like the following:
[hdfs]$ java -cp $ZK_HOME/lib/*:$ZK_HOME/zookeeper-3.4.2.jar org.apache.zookeeper.server.auth.DigestAuthenticationProvider hdfs-zkfcs:mypassword
output: hdfs-zkfcs:mypassword->hdfs-zkfcs:P/OQvnYyU/nF/mGYvB/xurX8dYs=
Copy and paste the section of this output after the '-\>' string into the file `zk-acls.txt`, prefixed by the string "`digest:`". For example:
digest:hdfs-zkfcs:vlUvLnd8MlacsE80rDuu6ONESbM=:rwcda
In order for these ACLs to take effect, you should then rerun the `zkfc -formatZK` command as described above.
After doing so, you may verify the ACLs from the ZK CLI as follows:
[zk: localhost:2181(CONNECTED) 1] getAcl /hadoop-ha
'digest,'hdfs-zkfcs:vlUvLnd8MlacsE80rDuu6ONESbM=
: cdrwa
### Verifying automatic failover
Once automatic failover has been set up, you should test its operation. To do so, first locate the active NameNode. You can tell which node is active by visiting the NameNode web interfaces -- each node reports its HA state at the top of the page.
Once you have located your active NameNode, you may cause a failure on that node. For example, you can use `kill -9 <pid of NN`\> to simulate a JVM crash. Or, you could power cycle the machine or unplug its network interface to simulate a different kind of outage. After triggering the outage you wish to test, the other NameNode should automatically become active within several seconds. The amount of time required to detect a failure and trigger a fail-over depends on the configuration of `ha.zookeeper.session-timeout.ms`, but defaults to 5 seconds.
If the test does not succeed, you may have a misconfiguration. Check the logs for the `zkfc` daemons as well as the NameNode daemons in order to further diagnose the issue.
Automatic Failover FAQ
----------------------
* **Is it important that I start the ZKFC and NameNode daemons in any particular order?**
No. On any given node you may start the ZKFC before or after its corresponding NameNode.
* **What additional monitoring should I put in place?**
You should add monitoring on each host that runs a NameNode to ensure that the
ZKFC remains running. In some types of ZooKeeper failures, for example, the
ZKFC may unexpectedly exit, and should be restarted to ensure that the system
is ready for automatic failover.
Additionally, you should monitor each of the servers in the ZooKeeper
quorum. If ZooKeeper crashes, then automatic failover will not function.
* **What happens if ZooKeeper goes down?**
If the ZooKeeper cluster crashes, no automatic failovers will be triggered.
However, HDFS will continue to run without any impact. When ZooKeeper is
restarted, HDFS will reconnect with no issues.
* **Can I designate one of my NameNodes as primary/preferred?**
No. Currently, this is not supported. Whichever NameNode is started first will
become active. You may choose to start the cluster in a specific order such
that your preferred node starts first.
* **How can I initiate a manual failover when automatic failover is configured?**
Even if automatic failover is configured, you may initiate a manual failover
using the same `hdfs haadmin` command. It will perform a coordinated
failover.
BookKeeper as a Shared storage (EXPERIMENTAL)
---------------------------------------------
One option for shared storage for the NameNode is BookKeeper. BookKeeper achieves high availability and strong durability guarantees by replicating edit log entries across multiple storage nodes. The edit log can be striped across the storage nodes for high performance. Fencing is supported in the protocol, i.e, BookKeeper will not allow two writers to write the single edit log.
The meta data for BookKeeper is stored in ZooKeeper. In current HA architecture, a Zookeeper cluster is required for ZKFC. The same cluster can be for BookKeeper metadata.
For more details on building a BookKeeper cluster, please refer to the [BookKeeper documentation](http://zookeeper.apache.org/bookkeeper/docs/trunk/bookkeeperConfig.html )
The BookKeeperJournalManager is an implementation of the HDFS JournalManager interface, which allows custom write ahead logging implementations to be plugged into the HDFS NameNode.
* **BookKeeper Journal Manager**
To use BookKeeperJournalManager, add the following to hdfs-site.xml.
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>bookkeeper://zk1:2181;zk2:2181;zk3:2181/hdfsjournal</value>
</property>
<property>
<name>dfs.namenode.edits.journal-plugin.bookkeeper</name>
<value>org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager</value>
</property>
The URI format for bookkeeper is `bookkeeper://[zkEnsemble]/[rootZnode] [zookkeeper ensemble]`
is a list of semi-colon separated, zookeeper host:port
pairs. In the example above there are 3 servers, in the ensemble,
zk1, zk2 & zk3, each one listening on port 2181.
`[root znode]` is the path of the zookeeper znode, under which the edit log
information will be stored.
The class specified for the journal-plugin must be available in the NameNode's
classpath. We explain how to generate a jar file with the journal manager and
its dependencies, and how to put it into the classpath below.
* **More configuration options**
* **dfs.namenode.bookkeeperjournal.output-buffer-size** -
Number of bytes a bookkeeper journal stream will buffer before
forcing a flush. Default is 1024.
<property>
<name>dfs.namenode.bookkeeperjournal.output-buffer-size</name>
<value>1024</value>
</property>
* **dfs.namenode.bookkeeperjournal.ensemble-size** -
Number of bookkeeper servers in edit log ensembles. This
is the number of bookkeeper servers which need to be available
for the edit log to be writable. Default is 3.
<property>
<name>dfs.namenode.bookkeeperjournal.ensemble-size</name>
<value>3</value>
</property>
* **dfs.namenode.bookkeeperjournal.quorum-size** -
Number of bookkeeper servers in the write quorum. This is the
number of bookkeeper servers which must have acknowledged the
write of an entry before it is considered written. Default is 2.
<property>
<name>dfs.namenode.bookkeeperjournal.quorum-size</name>
<value>2</value>
</property>
* **dfs.namenode.bookkeeperjournal.digestPw** -
Password to use when creating edit log segments.
<property>
<name>dfs.namenode.bookkeeperjournal.digestPw</name>
<value>myPassword</value>
</property>
* **dfs.namenode.bookkeeperjournal.zk.session.timeout** -
Session timeout for Zookeeper client from BookKeeper Journal Manager.
Hadoop recommends that this value should be less than the ZKFC
session timeout value. Default value is 3000.
<property>
<name>dfs.namenode.bookkeeperjournal.zk.session.timeout</name>
<value>3000</value>
</property>
* **Building BookKeeper Journal Manager plugin jar**
To generate the distribution packages for BK journal, do the following.
$ mvn clean package -Pdist
This will generate a jar with the BookKeeperJournalManager,
hadoop-hdfs/src/contrib/bkjournal/target/hadoop-hdfs-bkjournal-*VERSION*.jar
Note that the -Pdist part of the build command is important, this would
copy the dependent bookkeeper-server jar under
hadoop-hdfs/src/contrib/bkjournal/target/lib.
* **Putting the BookKeeperJournalManager in the NameNode classpath**
To run a HDFS namenode using BookKeeper as a backend, copy the bkjournal and
bookkeeper-server jar, mentioned above, into the lib directory of hdfs. In the
standard distribution of HDFS, this is at $HADOOP\_HDFS\_HOME/share/hadoop/hdfs/lib/
cp hadoop-hdfs/src/contrib/bkjournal/target/hadoop-hdfs-bkjournal-*VERSION*.jar $HADOOP\_HDFS\_HOME/share/hadoop/hdfs/lib/
* **Current limitations**
1) Security in BookKeeper. BookKeeper does not support SASL nor SSL for
connections between the NameNode and BookKeeper storage nodes.

View File

@ -0,0 +1,642 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
HDFS High Availability Using the Quorum Journal Manager
=======================================================
* [HDFS High Availability Using the Quorum Journal Manager](#HDFS_High_Availability_Using_the_Quorum_Journal_Manager)
* [Purpose](#Purpose)
* [Note: Using the Quorum Journal Manager or Conventional Shared Storage](#Note:_Using_the_Quorum_Journal_Manager_or_Conventional_Shared_Storage)
* [Background](#Background)
* [Architecture](#Architecture)
* [Hardware resources](#Hardware_resources)
* [Deployment](#Deployment)
* [Configuration overview](#Configuration_overview)
* [Configuration details](#Configuration_details)
* [Deployment details](#Deployment_details)
* [Administrative commands](#Administrative_commands)
* [Automatic Failover](#Automatic_Failover)
* [Introduction](#Introduction)
* [Components](#Components)
* [Deploying ZooKeeper](#Deploying_ZooKeeper)
* [Before you begin](#Before_you_begin)
* [Configuring automatic failover](#Configuring_automatic_failover)
* [Initializing HA state in ZooKeeper](#Initializing_HA_state_in_ZooKeeper)
* [Starting the cluster with start-dfs.sh](#Starting_the_cluster_with_start-dfs.sh)
* [Starting the cluster manually](#Starting_the_cluster_manually)
* [Securing access to ZooKeeper](#Securing_access_to_ZooKeeper)
* [Verifying automatic failover](#Verifying_automatic_failover)
* [Automatic Failover FAQ](#Automatic_Failover_FAQ)
* [HDFS Upgrade/Finalization/Rollback with HA Enabled](#HDFS_UpgradeFinalizationRollback_with_HA_Enabled)
Purpose
-------
This guide provides an overview of the HDFS High Availability (HA) feature and how to configure and manage an HA HDFS cluster, using the Quorum Journal Manager (QJM) feature.
This document assumes that the reader has a general understanding of general components and node types in an HDFS cluster. Please refer to the HDFS Architecture guide for details.
Note: Using the Quorum Journal Manager or Conventional Shared Storage
---------------------------------------------------------------------
This guide discusses how to configure and use HDFS HA using the Quorum Journal Manager (QJM) to share edit logs between the Active and Standby NameNodes. For information on how to configure HDFS HA using NFS for shared storage instead of the QJM, please see [this alternative guide.](./HDFSHighAvailabilityWithNFS.html)
Background
----------
Prior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster. Each cluster had a single NameNode, and if that machine or process became unavailable, the cluster as a whole would be unavailable until the NameNode was either restarted or brought up on a separate machine.
This impacted the total availability of the HDFS cluster in two major ways:
* In the case of an unplanned event such as a machine crash, the cluster would
be unavailable until an operator restarted the NameNode.
* Planned maintenance events such as software or hardware upgrades on the
NameNode machine would result in windows of cluster downtime.
The HDFS High Availability feature addresses the above problems by providing the option of running two redundant NameNodes in the same cluster in an Active/Passive configuration with a hot standby. This allows a fast failover to a new NameNode in the case that a machine crashes, or a graceful administrator-initiated failover for the purpose of planned maintenance.
Architecture
------------
In a typical HA cluster, two separate machines are configured as NameNodes. At any point in time, exactly one of the NameNodes is in an *Active* state, and the other is in a *Standby* state. The Active NameNode is responsible for all client operations in the cluster, while the Standby is simply acting as a slave, maintaining enough state to provide a fast failover if necessary.
In order for the Standby node to keep its state synchronized with the Active node, both nodes communicate with a group of separate daemons called "JournalNodes" (JNs). When any namespace modification is performed by the Active node, it durably logs a record of the modification to a majority of these JNs. The Standby node is capable of reading the edits from the JNs, and is constantly watching them for changes to the edit log. As the Standby Node sees the edits, it applies them to its own namespace. In the event of a failover, the Standby will ensure that it has read all of the edits from the JounalNodes before promoting itself to the Active state. This ensures that the namespace state is fully synchronized before a failover occurs.
In order to provide a fast failover, it is also necessary that the Standby node have up-to-date information regarding the location of blocks in the cluster. In order to achieve this, the DataNodes are configured with the location of both NameNodes, and send block location information and heartbeats to both.
It is vital for the correct operation of an HA cluster that only one of the NameNodes be Active at a time. Otherwise, the namespace state would quickly diverge between the two, risking data loss or other incorrect results. In order to ensure this property and prevent the so-called "split-brain scenario," the JournalNodes will only ever allow a single NameNode to be a writer at a time. During a failover, the NameNode which is to become active will simply take over the role of writing to the JournalNodes, which will effectively prevent the other NameNode from continuing in the Active state, allowing the new Active to safely proceed with failover.
Hardware resources
------------------
In order to deploy an HA cluster, you should prepare the following:
* **NameNode machines** - the machines on which you run the Active and
Standby NameNodes should have equivalent hardware to each other, and
equivalent hardware to what would be used in a non-HA cluster.
* **JournalNode machines** - the machines on which you run the JournalNodes.
The JournalNode daemon is relatively lightweight, so these daemons may
reasonably be collocated on machines with other Hadoop daemons, for example
NameNodes, the JobTracker, or the YARN ResourceManager. **Note:** There
must be at least 3 JournalNode daemons, since edit log modifications must be
written to a majority of JNs. This will allow the system to tolerate the
failure of a single machine. You may also run more than 3 JournalNodes, but
in order to actually increase the number of failures the system can tolerate,
you should run an odd number of JNs, (i.e. 3, 5, 7, etc.). Note that when
running with N JournalNodes, the system can tolerate at most (N - 1) / 2
failures and continue to function normally.
Note that, in an HA cluster, the Standby NameNode also performs checkpoints of the namespace state, and thus it is not necessary to run a Secondary NameNode, CheckpointNode, or BackupNode in an HA cluster. In fact, to do so would be an error. This also allows one who is reconfiguring a non-HA-enabled HDFS cluster to be HA-enabled to reuse the hardware which they had previously dedicated to the Secondary NameNode.
Deployment
----------
### Configuration overview
Similar to Federation configuration, HA configuration is backward compatible and allows existing single NameNode configurations to work without change. The new configuration is designed such that all the nodes in the cluster may have the same configuration without the need for deploying different configuration files to different machines based on the type of the node.
Like HDFS Federation, HA clusters reuse the `nameservice ID` to identify a single HDFS instance that may in fact consist of multiple HA NameNodes. In addition, a new abstraction called `NameNode ID` is added with HA. Each distinct NameNode in the cluster has a different NameNode ID to distinguish it. To support a single configuration file for all of the NameNodes, the relevant configuration parameters are suffixed with the **nameservice ID** as well as the **NameNode ID**.
### Configuration details
To configure HA NameNodes, you must add several configuration options to your **hdfs-site.xml** configuration file.
The order in which you set these configurations is unimportant, but the values you choose for **dfs.nameservices** and **dfs.ha.namenodes.[nameservice ID]** will determine the keys of those that follow. Thus, you should decide on these values before setting the rest of the configuration options.
* **dfs.nameservices** - the logical name for this new nameservice
Choose a logical name for this nameservice, for example "mycluster", and use
this logical name for the value of this config option. The name you choose is
arbitrary. It will be used both for configuration and as the authority
component of absolute HDFS paths in the cluster.
**Note:** If you are also using HDFS Federation, this configuration setting
should also include the list of other nameservices, HA or otherwise, as a
comma-separated list.
<property>
<name>dfs.nameservices</name>
<value>mycluster</value>
</property>
* **dfs.ha.namenodes.[nameservice ID]** - unique identifiers for each NameNode in the nameservice
Configure with a list of comma-separated NameNode IDs. This will be used by
DataNodes to determine all the NameNodes in the cluster. For example, if you
used "mycluster" as the nameservice ID previously, and you wanted to use "nn1"
and "nn2" as the individual IDs of the NameNodes, you would configure this as
such:
<property>
<name>dfs.ha.namenodes.mycluster</name>
<value>nn1,nn2</value>
</property>
**Note:** Currently, only a maximum of two NameNodes may be configured per nameservice.
* **dfs.namenode.rpc-address.[nameservice ID].[name node ID]** - the fully-qualified RPC address for each NameNode to listen on
For both of the previously-configured NameNode IDs, set the full address and
IPC port of the NameNode processs. Note that this results in two separate
configuration options. For example:
<property>
<name>dfs.namenode.rpc-address.mycluster.nn1</name>
<value>machine1.example.com:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.mycluster.nn2</name>
<value>machine2.example.com:8020</value>
</property>
**Note:** You may similarly configure the "**servicerpc-address**" setting if you so desire.
* **dfs.namenode.http-address.[nameservice ID].[name node ID]** - the fully-qualified HTTP address for each NameNode to listen on
Similarly to *rpc-address* above, set the addresses for both NameNodes' HTTP
servers to listen on. For example:
<property>
<name>dfs.namenode.http-address.mycluster.nn1</name>
<value>machine1.example.com:50070</value>
</property>
<property>
<name>dfs.namenode.http-address.mycluster.nn2</name>
<value>machine2.example.com:50070</value>
</property>
**Note:** If you have Hadoop's security features enabled, you should also set
the *https-address* similarly for each NameNode.
* **dfs.namenode.shared.edits.dir** - the URI which identifies the group of JNs where the NameNodes will write/read edits
This is where one configures the addresses of the JournalNodes which provide
the shared edits storage, written to by the Active nameNode and read by the
Standby NameNode to stay up-to-date with all the file system changes the Active
NameNode makes. Though you must specify several JournalNode addresses,
**you should only configure one of these URIs.** The URI should be of the form:
`qjournal://*host1:port1*;*host2:port2*;*host3:port3*/*journalId*`. The Journal
ID is a unique identifier for this nameservice, which allows a single set of
JournalNodes to provide storage for multiple federated namesystems. Though not
a requirement, it's a good idea to reuse the nameservice ID for the journal
identifier.
For example, if the JournalNodes for this cluster were running on the
machines "node1.example.com", "node2.example.com", and "node3.example.com" and
the nameservice ID were "mycluster", you would use the following as the value
for this setting (the default port for the JournalNode is 8485):
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://node1.example.com:8485;node2.example.com:8485;node3.example.com:8485/mycluster</value>
</property>
* **dfs.client.failover.proxy.provider.[nameservice ID]** - the Java class that HDFS clients use to contact the Active NameNode
Configure the name of the Java class which will be used by the DFS Client to
determine which NameNode is the current Active, and therefore which NameNode is
currently serving client requests. The only implementation which currently
ships with Hadoop is the **ConfiguredFailoverProxyProvider**, so use this
unless you are using a custom one. For example:
<property>
<name>dfs.client.failover.proxy.provider.mycluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
* **dfs.ha.fencing.methods** - a list of scripts or Java classes which will be used to fence the Active NameNode during a failover
It is desirable for correctness of the system that only one NameNode be in
the Active state at any given time. **Importantly, when using the Quorum
Journal Manager, only one NameNode will ever be allowed to write to the
JournalNodes, so there is no potential for corrupting the file system metadata
from a split-brain scenario.** However, when a failover occurs, it is still
possible that the previous Active NameNode could serve read requests to
clients, which may be out of date until that NameNode shuts down when trying to
write to the JournalNodes. For this reason, it is still desirable to configure
some fencing methods even when using the Quorum Journal Manager. However, to
improve the availability of the system in the event the fencing mechanisms
fail, it is advisable to configure a fencing method which is guaranteed to
return success as the last fencing method in the list. Note that if you choose
to use no actual fencing methods, you still must configure something for this
setting, for example "`shell(/bin/true)`".
The fencing methods used during a failover are configured as a
carriage-return-separated list, which will be attempted in order until
one indicates that fencing has succeeded. There are two methods which ship with
Hadoop: *shell* and *sshfence*. For information on implementing your own custom
fencing method, see the *org.apache.hadoop.ha.NodeFencer* class.
- - -
**sshfence** - SSH to the Active NameNode and kill the process
The *sshfence* option SSHes to the target node and uses *fuser* to kill the
process listening on the service's TCP port. In order for this fencing option
to work, it must be able to SSH to the target node without providing a
passphrase. Thus, one must also configure the
**dfs.ha.fencing.ssh.private-key-files** option, which is a
comma-separated list of SSH private key files. For example:
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/home/exampleuser/.ssh/id_rsa</value>
</property>
Optionally, one may configure a non-standard username or port to perform the
SSH. One may also configure a timeout, in milliseconds, for the SSH, after
which this fencing method will be considered to have failed. It may be
configured like so:
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence([[username][:port]])</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.connect-timeout</name>
<value>30000</value>
</property>
- - -
**shell** - run an arbitrary shell command to fence the Active NameNode
The *shell* fencing method runs an arbitrary shell command. It may be
configured like so:
<property>
<name>dfs.ha.fencing.methods</name>
<value>shell(/path/to/my/script.sh arg1 arg2 ...)</value>
</property>
The string between '(' and ')' is passed directly to a bash shell and may not
include any closing parentheses.
The shell command will be run with an environment set up to contain all of the
current Hadoop configuration variables, with the '\_' character replacing any
'.' characters in the configuration keys. The configuration used has already had
any namenode-specific configurations promoted to their generic forms -- for example
**dfs\_namenode\_rpc-address** will contain the RPC address of the target node, even
though the configuration may specify that variable as
**dfs.namenode.rpc-address.ns1.nn1**.
Additionally, the following variables referring to the target node to be fenced
are also available:
| | |
|:---- |:---- |
| $target\_host | hostname of the node to be fenced |
| $target\_port | IPC port of the node to be fenced |
| $target\_address | the above two, combined as host:port |
| $target\_nameserviceid | the nameservice ID of the NN to be fenced |
| $target\_namenodeid | the namenode ID of the NN to be fenced |
These environment variables may also be used as substitutions in the shell
command itself. For example:
<property>
<name>dfs.ha.fencing.methods</name>
<value>shell(/path/to/my/script.sh --nameservice=$target_nameserviceid $target_host:$target_port)</value>
</property>
If the shell command returns an exit
code of 0, the fencing is determined to be successful. If it returns any other
exit code, the fencing was not successful and the next fencing method in the
list will be attempted.
**Note:** This fencing method does not implement any timeout. If timeouts are
necessary, they should be implemented in the shell script itself (eg by forking
a subshell to kill its parent in some number of seconds).
- - -
* **fs.defaultFS** - the default path prefix used by the Hadoop FS client when none is given
Optionally, you may now configure the default path for Hadoop clients to use
the new HA-enabled logical URI. If you used "mycluster" as the nameservice ID
earlier, this will be the value of the authority portion of all of your HDFS
paths. This may be configured like so, in your **core-site.xml** file:
<property>
<name>fs.defaultFS</name>
<value>hdfs://mycluster</value>
</property>
* **dfs.journalnode.edits.dir** - the path where the JournalNode daemon will store its local state
This is the absolute path on the JournalNode machines where the edits and
other local state used by the JNs will be stored. You may only use a single
path for this configuration. Redundancy for this data is provided by running
multiple separate JournalNodes, or by configuring this directory on a
locally-attached RAID array. For example:
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/path/to/journal/node/local/data</value>
</property>
### Deployment details
After all of the necessary configuration options have been set, you must start the JournalNode daemons on the set of machines where they will run. This can be done by running the command "*hadoop-daemon.sh start journalnode*" and waiting for the daemon to start on each of the relevant machines.
Once the JournalNodes have been started, one must initially synchronize the two HA NameNodes' on-disk metadata.
* If you are setting up a fresh HDFS cluster, you should first run the format
command (*hdfs namenode -format*) on one of NameNodes.
* If you have already formatted the NameNode, or are converting a
non-HA-enabled cluster to be HA-enabled, you should now copy over the
contents of your NameNode metadata directories to the other, unformatted
NameNode by running the command "*hdfs namenode -bootstrapStandby*" on the
unformatted NameNode. Running this command will also ensure that the
JournalNodes (as configured by **dfs.namenode.shared.edits.dir**) contain
sufficient edits transactions to be able to start both NameNodes.
* If you are converting a non-HA NameNode to be HA, you should run the
command "*hdfs -initializeSharedEdits*", which will initialize the
JournalNodes with the edits data from the local NameNode edits directories.
At this point you may start both of your HA NameNodes as you normally would start a NameNode.
You can visit each of the NameNodes' web pages separately by browsing to their configured HTTP addresses. You should notice that next to the configured address will be the HA state of the NameNode (either "standby" or "active".) Whenever an HA NameNode starts, it is initially in the Standby state.
### Administrative commands
Now that your HA NameNodes are configured and started, you will have access to some additional commands to administer your HA HDFS cluster. Specifically, you should familiarize yourself with all of the subcommands of the "*hdfs haadmin*" command. Running this command without any additional arguments will display the following usage information:
Usage: DFSHAAdmin [-ns <nameserviceId>]
[-transitionToActive <serviceId>]
[-transitionToStandby <serviceId>]
[-failover [--forcefence] [--forceactive] <serviceId> <serviceId>]
[-getServiceState <serviceId>]
[-checkHealth <serviceId>]
[-help <command>]
This guide describes high-level uses of each of these subcommands. For specific usage information of each subcommand, you should run "*hdfs haadmin -help \<command*\>".
* **transitionToActive** and **transitionToStandby** - transition the state of the given NameNode to Active or Standby
These subcommands cause a given NameNode to transition to the Active or Standby
state, respectively. **These commands do not attempt to perform any fencing,
and thus should rarely be used.** Instead, one should almost always prefer to
use the "*hdfs haadmin -failover*" subcommand.
* **failover** - initiate a failover between two NameNodes
This subcommand causes a failover from the first provided NameNode to the
second. If the first NameNode is in the Standby state, this command simply
transitions the second to the Active state without error. If the first NameNode
is in the Active state, an attempt will be made to gracefully transition it to
the Standby state. If this fails, the fencing methods (as configured by
**dfs.ha.fencing.methods**) will be attempted in order until one
succeeds. Only after this process will the second NameNode be transitioned to
the Active state. If no fencing method succeeds, the second NameNode will not
be transitioned to the Active state, and an error will be returned.
* **getServiceState** - determine whether the given NameNode is Active or Standby
Connect to the provided NameNode to determine its current state, printing
either "standby" or "active" to STDOUT appropriately. This subcommand might be
used by cron jobs or monitoring scripts which need to behave differently based
on whether the NameNode is currently Active or Standby.
* **checkHealth** - check the health of the given NameNode
Connect to the provided NameNode to check its health. The NameNode is capable
of performing some diagnostics on itself, including checking if internal
services are running as expected. This command will return 0 if the NameNode is
healthy, non-zero otherwise. One might use this command for monitoring purposes.
**Note:** This is not yet implemented, and at present will always return
success, unless the given NameNode is completely down.
Automatic Failover
------------------
### Introduction
The above sections describe how to configure manual failover. In that mode, the system will not automatically trigger a failover from the active to the standby NameNode, even if the active node has failed. This section describes how to configure and deploy automatic failover.
### Components
Automatic failover adds two new components to an HDFS deployment: a ZooKeeper quorum, and the ZKFailoverController process (abbreviated as ZKFC).
Apache ZooKeeper is a highly available service for maintaining small amounts of coordination data, notifying clients of changes in that data, and monitoring clients for failures. The implementation of automatic HDFS failover relies on ZooKeeper for the following things:
* **Failure detection** - each of the NameNode machines in the cluster
maintains a persistent session in ZooKeeper. If the machine crashes, the
ZooKeeper session will expire, notifying the other NameNode that a failover
should be triggered.
* **Active NameNode election** - ZooKeeper provides a simple mechanism to
exclusively elect a node as active. If the current active NameNode crashes,
another node may take a special exclusive lock in ZooKeeper indicating that
it should become the next active.
The ZKFailoverController (ZKFC) is a new component which is a ZooKeeper client which also monitors and manages the state of the NameNode. Each of the machines which runs a NameNode also runs a ZKFC, and that ZKFC is responsible for:
* **Health monitoring** - the ZKFC pings its local NameNode on a periodic
basis with a health-check command. So long as the NameNode responds in a
timely fashion with a healthy status, the ZKFC considers the node
healthy. If the node has crashed, frozen, or otherwise entered an unhealthy
state, the health monitor will mark it as unhealthy.
* **ZooKeeper session management** - when the local NameNode is healthy, the
ZKFC holds a session open in ZooKeeper. If the local NameNode is active, it
also holds a special "lock" znode. This lock uses ZooKeeper's support for
"ephemeral" nodes; if the session expires, the lock node will be
automatically deleted.
* **ZooKeeper-based election** - if the local NameNode is healthy, and the
ZKFC sees that no other node currently holds the lock znode, it will itself
try to acquire the lock. If it succeeds, then it has "won the election", and
is responsible for running a failover to make its local NameNode active. The
failover process is similar to the manual failover described above: first,
the previous active is fenced if necessary, and then the local NameNode
transitions to active state.
For more details on the design of automatic failover, refer to the design document attached to HDFS-2185 on the Apache HDFS JIRA.
### Deploying ZooKeeper
In a typical deployment, ZooKeeper daemons are configured to run on three or five nodes. Since ZooKeeper itself has light resource requirements, it is acceptable to collocate the ZooKeeper nodes on the same hardware as the HDFS NameNode and Standby Node. Many operators choose to deploy the third ZooKeeper process on the same node as the YARN ResourceManager. It is advisable to configure the ZooKeeper nodes to store their data on separate disk drives from the HDFS metadata for best performance and isolation.
The setup of ZooKeeper is out of scope for this document. We will assume that you have set up a ZooKeeper cluster running on three or more nodes, and have verified its correct operation by connecting using the ZK CLI.
### Before you begin
Before you begin configuring automatic failover, you should shut down your cluster. It is not currently possible to transition from a manual failover setup to an automatic failover setup while the cluster is running.
### Configuring automatic failover
The configuration of automatic failover requires the addition of two new parameters to your configuration. In your `hdfs-site.xml` file, add:
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
This specifies that the cluster should be set up for automatic failover. In your `core-site.xml` file, add:
<property>
<name>ha.zookeeper.quorum</name>
<value>zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181</value>
</property>
This lists the host-port pairs running the ZooKeeper service.
As with the parameters described earlier in the document, these settings may be configured on a per-nameservice basis by suffixing the configuration key with the nameservice ID. For example, in a cluster with federation enabled, you can explicitly enable automatic failover for only one of the nameservices by setting `dfs.ha.automatic-failover.enabled.my-nameservice-id`.
There are also several other configuration parameters which may be set to control the behavior of automatic failover; however, they are not necessary for most installations. Please refer to the configuration key specific documentation for details.
### Initializing HA state in ZooKeeper
After the configuration keys have been added, the next step is to initialize required state in ZooKeeper. You can do so by running the following command from one of the NameNode hosts.
[hdfs]$ $HADOOP_PREFIX/bin/hdfs zkfc -formatZK
This will create a znode in ZooKeeper inside of which the automatic failover system stores its data.
### Starting the cluster with `start-dfs.sh`
Since automatic failover has been enabled in the configuration, the `start-dfs.sh` script will now automatically start a ZKFC daemon on any machine that runs a NameNode. When the ZKFCs start, they will automatically select one of the NameNodes to become active.
### Starting the cluster manually
If you manually manage the services on your cluster, you will need to manually start the `zkfc` daemon on each of the machines that runs a NameNode. You can start the daemon by running:
[hdfs]$ $HADOOP_PREFIX/bin/hdfs --daemon start zkfc
### Securing access to ZooKeeper
If you are running a secure cluster, you will likely want to ensure that the information stored in ZooKeeper is also secured. This prevents malicious clients from modifying the metadata in ZooKeeper or potentially triggering a false failover.
In order to secure the information in ZooKeeper, first add the following to your `core-site.xml` file:
<property>
<name>ha.zookeeper.auth</name>
<value>@/path/to/zk-auth.txt</value>
</property>
<property>
<name>ha.zookeeper.acl</name>
<value>@/path/to/zk-acl.txt</value>
</property>
Please note the '@' character in these values -- this specifies that the configurations are not inline, but rather point to a file on disk.
The first configured file specifies a list of ZooKeeper authentications, in the same format as used by the ZK CLI. For example, you may specify something like:
digest:hdfs-zkfcs:mypassword
...where `hdfs-zkfcs` is a unique username for ZooKeeper, and `mypassword` is some unique string used as a password.
Next, generate a ZooKeeper ACL that corresponds to this authentication, using a command like the following:
[hdfs]$ java -cp $ZK_HOME/lib/*:$ZK_HOME/zookeeper-3.4.2.jar org.apache.zookeeper.server.auth.DigestAuthenticationProvider hdfs-zkfcs:mypassword
output: hdfs-zkfcs:mypassword->hdfs-zkfcs:P/OQvnYyU/nF/mGYvB/xurX8dYs=
Copy and paste the section of this output after the '-\>' string into the file `zk-acls.txt`, prefixed by the string "`digest:`". For example:
digest:hdfs-zkfcs:vlUvLnd8MlacsE80rDuu6ONESbM=:rwcda
In order for these ACLs to take effect, you should then rerun the `zkfc -formatZK` command as described above.
After doing so, you may verify the ACLs from the ZK CLI as follows:
[zk: localhost:2181(CONNECTED) 1] getAcl /hadoop-ha
'digest,'hdfs-zkfcs:vlUvLnd8MlacsE80rDuu6ONESbM=
: cdrwa
### Verifying automatic failover
Once automatic failover has been set up, you should test its operation. To do so, first locate the active NameNode. You can tell which node is active by visiting the NameNode web interfaces -- each node reports its HA state at the top of the page.
Once you have located your active NameNode, you may cause a failure on that node. For example, you can use `kill -9 <pid of NN`\> to simulate a JVM crash. Or, you could power cycle the machine or unplug its network interface to simulate a different kind of outage. After triggering the outage you wish to test, the other NameNode should automatically become active within several seconds. The amount of time required to detect a failure and trigger a fail-over depends on the configuration of `ha.zookeeper.session-timeout.ms`, but defaults to 5 seconds.
If the test does not succeed, you may have a misconfiguration. Check the logs for the `zkfc` daemons as well as the NameNode daemons in order to further diagnose the issue.
Automatic Failover FAQ
----------------------
* **Is it important that I start the ZKFC and NameNode daemons in any particular order?**
No. On any given node you may start the ZKFC before or after its corresponding NameNode.
* **What additional monitoring should I put in place?**
You should add monitoring on each host that runs a NameNode to ensure that the
ZKFC remains running. In some types of ZooKeeper failures, for example, the
ZKFC may unexpectedly exit, and should be restarted to ensure that the system
is ready for automatic failover.
Additionally, you should monitor each of the servers in the ZooKeeper
quorum. If ZooKeeper crashes, then automatic failover will not function.
* **What happens if ZooKeeper goes down?**
If the ZooKeeper cluster crashes, no automatic failovers will be triggered.
However, HDFS will continue to run without any impact. When ZooKeeper is
restarted, HDFS will reconnect with no issues.
* **Can I designate one of my NameNodes as primary/preferred?**
No. Currently, this is not supported. Whichever NameNode is started first will
become active. You may choose to start the cluster in a specific order such
that your preferred node starts first.
* **How can I initiate a manual failover when automatic failover is configured?**
Even if automatic failover is configured, you may initiate a manual failover
using the same `hdfs haadmin` command. It will perform a coordinated
failover.
HDFS Upgrade/Finalization/Rollback with HA Enabled
--------------------------------------------------
When moving between versions of HDFS, sometimes the newer software can simply be installed and the cluster restarted. Sometimes, however, upgrading the version of HDFS you're running may require changing on-disk data. In this case, one must use the HDFS Upgrade/Finalize/Rollback facility after installing the new software. This process is made more complex in an HA environment, since the on-disk metadata that the NN relies upon is by definition distributed, both on the two HA NNs in the pair, and on the JournalNodes in the case that QJM is being used for the shared edits storage. This documentation section describes the procedure to use the HDFS Upgrade/Finalize/Rollback facility in an HA setup.
**To perform an HA upgrade**, the operator must do the following:
1. Shut down all of the NNs as normal, and install the newer software.
2. Start up all of the JNs. Note that it is **critical** that all the
JNs be running when performing the upgrade, rollback, or finalization
operations. If any of the JNs are down at the time of running any of these
operations, the operation will fail.
3. Start one of the NNs with the `'-upgrade'` flag.
4. On start, this NN will not enter the standby state as usual in an HA
setup. Rather, this NN will immediately enter the active state, perform an
upgrade of its local storage dirs, and also perform an upgrade of the shared
edit log.
5. At this point the other NN in the HA pair will be out of sync with
the upgraded NN. In order to bring it back in sync and once again have a highly
available setup, you should re-bootstrap this NameNode by running the NN with
the `'-bootstrapStandby'` flag. It is an error to start this second NN with
the `'-upgrade'` flag.
Note that if at any time you want to restart the NameNodes before finalizing or rolling back the upgrade, you should start the NNs as normal, i.e. without any special startup flag.
**To finalize an HA upgrade**, the operator will use the `` `hdfs dfsadmin -finalizeUpgrade' `` command while the NNs are running and one of them is active. The active NN at the time this happens will perform the finalization of the shared log, and the NN whose local storage directories contain the previous FS state will delete its local state.
**To perform a rollback** of an upgrade, both NNs should first be shut down. The operator should run the roll back command on the NN where they initiated the upgrade procedure, which will perform the rollback on the local dirs there, as well as on the shared log, either NFS or on the JNs. Afterward, this NN should be started and the operator should run `` `-bootstrapStandby' `` on the other NN to bring the two NNs in sync with this rolled-back file system state.

View File

@ -0,0 +1,240 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
HDFS Architecture
=================
* [HDFS Architecture](#HDFS_Architecture)
* [Introduction](#Introduction)
* [Assumptions and Goals](#Assumptions_and_Goals)
* [Hardware Failure](#Hardware_Failure)
* [Streaming Data Access](#Streaming_Data_Access)
* [Large Data Sets](#Large_Data_Sets)
* [Simple Coherency Model](#Simple_Coherency_Model)
* ["Moving Computation is Cheaper than Moving Data"](#aMoving_Computation_is_Cheaper_than_Moving_Data)
* [Portability Across Heterogeneous Hardware and Software Platforms](#Portability_Across_Heterogeneous_Hardware_and_Software_Platforms)
* [NameNode and DataNodes](#NameNode_and_DataNodes)
* [The File System Namespace](#The_File_System_Namespace)
* [Data Replication](#Data_Replication)
* [Replica Placement: The First Baby Steps](#Replica_Placement:_The_First_Baby_Steps)
* [Replica Selection](#Replica_Selection)
* [Safemode](#Safemode)
* [The Persistence of File System Metadata](#The_Persistence_of_File_System_Metadata)
* [The Communication Protocols](#The_Communication_Protocols)
* [Robustness](#Robustness)
* [Data Disk Failure, Heartbeats and Re-Replication](#Data_Disk_Failure_Heartbeats_and_Re-Replication)
* [Cluster Rebalancing](#Cluster_Rebalancing)
* [Data Integrity](#Data_Integrity)
* [Metadata Disk Failure](#Metadata_Disk_Failure)
* [Snapshots](#Snapshots)
* [Data Organization](#Data_Organization)
* [Data Blocks](#Data_Blocks)
* [Staging](#Staging)
* [Replication Pipelining](#Replication_Pipelining)
* [Accessibility](#Accessibility)
* [FS Shell](#FS_Shell)
* [DFSAdmin](#DFSAdmin)
* [Browser Interface](#Browser_Interface)
* [Space Reclamation](#Space_Reclamation)
* [File Deletes and Undeletes](#File_Deletes_and_Undeletes)
* [Decrease Replication Factor](#Decrease_Replication_Factor)
* [References](#References)
Introduction
------------
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is part of the Apache Hadoop Core project. The project URL is <http://hadoop.apache.org/>.
Assumptions and Goals
---------------------
### Hardware Failure
Hardware failure is the norm rather than the exception. An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file systems data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.
### Streaming Data Access
Applications that run on HDFS need streaming access to their data sets. They are not general purpose applications that typically run on general purpose file systems. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access. POSIX imposes many hard requirements that are not needed for applications that are targeted for HDFS. POSIX semantics in a few key areas has been traded to increase data throughput rates.
### Large Data Sets
Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance.
### Simple Coherency Model
HDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed. This assumption simplifies data coherency issues and enables high throughput data access. A Map/Reduce application or a web crawler application fits perfectly with this model. There is a plan to support appending-writes to files in the future.
### "Moving Computation is Cheaper than Moving Data"
A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimizes network congestion and increases the overall throughput of the system. The assumption is that it is often better to migrate the computation closer to where the data is located rather than moving the data to where the application is running. HDFS provides interfaces for applications to move themselves closer to where the data is located.
### Portability Across Heterogeneous Hardware and Software Platforms
HDFS has been designed to be easily portable from one platform to another. This facilitates widespread adoption of HDFS as a platform of choice for a large set of applications.
NameNode and DataNodes
----------------------
HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file systems clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.
![HDFS Architecture](images/hdfsarchitecture.png)
The NameNode and DataNode are pieces of software designed to run on commodity machines. These machines typically run a GNU/Linux operating system (OS). HDFS is built using the Java language; any machine that supports Java can run the NameNode or the DataNode software. Usage of the highly portable Java language means that HDFS can be deployed on a wide range of machines. A typical deployment has a dedicated machine that runs only the NameNode software. Each of the other machines in the cluster runs one instance of the DataNode software. The architecture does not preclude running multiple DataNodes on the same machine but in a real deployment that is rarely the case.
The existence of a single NameNode in a cluster greatly simplifies the architecture of the system. The NameNode is the arbitrator and repository for all HDFS metadata. The system is designed in such a way that user data never flows through the NameNode.
The File System Namespace
-------------------------
HDFS supports a traditional hierarchical file organization. A user or an application can create directories and store files inside these directories. The file system namespace hierarchy is similar to most other existing file systems; one can create and remove files, move a file from one directory to another, or rename a file. HDFS does not yet implement user quotas or access permissions. HDFS does not support hard links or soft links. However, the HDFS architecture does not preclude implementing these features.
The NameNode maintains the file system namespace. Any change to the file system namespace or its properties is recorded by the NameNode. An application can specify the number of replicas of a file that should be maintained by HDFS. The number of copies of a file is called the replication factor of that file. This information is stored by the NameNode.
Data Replication
----------------
HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file. An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. Files in HDFS are write-once and have strictly one writer at any time.
The NameNode makes all decisions regarding replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode.
![HDFS DataNodes](images/hdfsdatanodes.png)
### Replica Placement: The First Baby Steps
The placement of replicas is critical to HDFS reliability and performance. Optimizing replica placement distinguishes HDFS from most other distributed file systems. This is a feature that needs lots of tuning and experience. The purpose of a rack-aware replica placement policy is to improve data reliability, availability, and network bandwidth utilization. The current implementation for the replica placement policy is a first effort in this direction. The short-term goals of implementing this policy are to validate it on production systems, learn more about its behavior, and build a foundation to test and research more sophisticated policies.
Large HDFS instances run on a cluster of computers that commonly spread across many racks. Communication between two nodes in different racks has to go through switches. In most cases, network bandwidth between machines in the same rack is greater than network bandwidth between machines in different racks.
The NameNode determines the rack id each DataNode belongs to via the process outlined in [Hadoop Rack Awareness](../hadoop-common/ClusterSetup.html#HadoopRackAwareness). A simple but non-optimal policy is to place replicas on unique racks. This prevents losing data when an entire rack fails and allows use of bandwidth from multiple racks when reading data. This policy evenly distributes replicas in the cluster which makes it easy to balance load on component failure. However, this policy increases the cost of writes because a write needs to transfer blocks to multiple racks.
For the common case, when the replication factor is three, HDFSs placement policy is to put one replica on one node in the local rack, another on a different node in the local rack, and the last on a different node in a different rack. This policy cuts the inter-rack write traffic which generally improves write performance. The chance of rack failure is far less than that of node failure; this policy does not impact data reliability and availability guarantees. However, it does reduce the aggregate network bandwidth used when reading data since a block is placed in only two unique racks rather than three. With this policy, the replicas of a file do not evenly distribute across the racks. One third of replicas are on one node, two thirds of replicas are on one rack, and the other third are evenly distributed across the remaining racks. This policy improves write performance without compromising data reliability or read performance.
The current, default replica placement policy described here is a work in progress.
### Replica Selection
To minimize global bandwidth consumption and read latency, HDFS tries to satisfy a read request from a replica that is closest to the reader. If there exists a replica on the same rack as the reader node, then that replica is preferred to satisfy the read request. If angg/ HDFS cluster spans multiple data centers, then a replica that is resident in the local data center is preferred over any remote replica.
### Safemode
On startup, the NameNode enters a special state called Safemode. Replication of data blocks does not occur when the NameNode is in the Safemode state. The NameNode receives Heartbeat and Blockreport messages from the DataNodes. A Blockreport contains the list of data blocks that a DataNode is hosting. Each block has a specified minimum number of replicas. A block is considered safely replicated when the minimum number of replicas of that data block has checked in with the NameNode. After a configurable percentage of safely replicated data blocks checks in with the NameNode (plus an additional 30 seconds), the NameNode exits the Safemode state. It then determines the list of data blocks (if any) that still have fewer than the specified number of replicas. The NameNode then replicates these blocks to other DataNodes.
The Persistence of File System Metadata
---------------------------------------
The HDFS namespace is stored by the NameNode. The NameNode uses a transaction log called the EditLog to persistently record every change that occurs to file system metadata. For example, creating a new file in HDFS causes the NameNode to insert a record into the EditLog indicating this. Similarly, changing the replication factor of a file causes a new record to be inserted into the EditLog. The NameNode uses a file in its local host OS file system to store the EditLog. The entire file system namespace, including the mapping of blocks to files and file system properties, is stored in a file called the FsImage. The FsImage is stored as a file in the NameNodes local file system too.
The NameNode keeps an image of the entire file system namespace and file Blockmap in memory. This key metadata item is designed to be compact, such that a NameNode with 4 GB of RAM is plenty to support a huge number of files and directories. When the NameNode starts up, it reads the FsImage and EditLog from disk, applies all the transactions from the EditLog to the in-memory representation of the FsImage, and flushes out this new version into a new FsImage on disk. It can then truncate the old EditLog because its transactions have been applied to the persistent FsImage. This process is called a checkpoint. In the current implementation, a checkpoint only occurs when the NameNode starts up. Work is in progress to support periodic checkpointing in the near future.
The DataNode stores HDFS data in files in its local file system. The DataNode has no knowledge about HDFS files. It stores each block of HDFS data in a separate file in its local file system. The DataNode does not create all files in the same directory. Instead, it uses a heuristic to determine the optimal number of files per directory and creates subdirectories appropriately. It is not optimal to create all local files in the same directory because the local file system might not be able to efficiently support a huge number of files in a single directory. When a DataNode starts up, it scans through its local file system, generates a list of all HDFS data blocks that correspond to each of these local files and sends this report to the NameNode: this is the Blockreport.
The Communication Protocols
---------------------------
All HDFS communication protocols are layered on top of the TCP/IP protocol. A client establishes a connection to a configurable TCP port on the NameNode machine. It talks the ClientProtocol with the NameNode. The DataNodes talk to the NameNode using the DataNode Protocol. A Remote Procedure Call (RPC) abstraction wraps both the Client Protocol and the DataNode Protocol. By design, the NameNode never initiates any RPCs. Instead, it only responds to RPC requests issued by DataNodes or clients.
Robustness
----------
The primary objective of HDFS is to store data reliably even in the presence of failures. The three common types of failures are NameNode failures, DataNode failures and network partitions.
### Data Disk Failure, Heartbeats and Re-Replication
Each DataNode sends a Heartbeat message to the NameNode periodically. A network partition can cause a subset of DataNodes to lose connectivity with the NameNode. The NameNode detects this condition by the absence of a Heartbeat message. The NameNode marks DataNodes without recent Heartbeats as dead and does not forward any new IO requests to them. Any data that was registered to a dead DataNode is not available to HDFS any more. DataNode death may cause the replication factor of some blocks to fall below their specified value. The NameNode constantly tracks which blocks need to be replicated and initiates replication whenever necessary. The necessity for re-replication may arise due to many reasons: a DataNode may become unavailable, a replica may become corrupted, a hard disk on a DataNode may fail, or the replication factor of a file may be increased.
### Cluster Rebalancing
The HDFS architecture is compatible with data rebalancing schemes. A scheme might automatically move data from one DataNode to another if the free space on a DataNode falls below a certain threshold. In the event of a sudden high demand for a particular file, a scheme might dynamically create additional replicas and rebalance other data in the cluster. These types of data rebalancing schemes are not yet implemented.
### Data Integrity
It is possible that a block of data fetched from a DataNode arrives corrupted. This corruption can occur because of faults in a storage device, network faults, or buggy software. The HDFS client software implements checksum checking on the contents of HDFS files. When a client creates an HDFS file, it computes a checksum of each block of the file and stores these checksums in a separate hidden file in the same HDFS namespace. When a client retrieves file contents it verifies that the data it received from each DataNode matches the checksum stored in the associated checksum file. If not, then the client can opt to retrieve that block from another DataNode that has a replica of that block.
### Metadata Disk Failure
The FsImage and the EditLog are central data structures of HDFS. A corruption of these files can cause the HDFS instance to be non-functional. For this reason, the NameNode can be configured to support maintaining multiple copies of the FsImage and EditLog. Any update to either the FsImage or EditLog causes each of the FsImages and EditLogs to get updated synchronously. This synchronous updating of multiple copies of the FsImage and EditLog may degrade the rate of namespace transactions per second that a NameNode can support. However, this degradation is acceptable because even though HDFS applications are very data intensive in nature, they are not metadata intensive. When a NameNode restarts, it selects the latest consistent FsImage and EditLog to use.
The NameNode machine is a single point of failure for an HDFS cluster. If the NameNode machine fails, manual intervention is necessary. Currently, automatic restart and failover of the NameNode software to another machine is not supported.
### Snapshots
Snapshots support storing a copy of data at a particular instant of time. One usage of the snapshot feature may be to roll back a corrupted HDFS instance to a previously known good point in time. HDFS does not currently support snapshots but will in a future release.
Data Organization
-----------------
### Data Blocks
HDFS is designed to support very large files. Applications that are compatible with HDFS are those that deal with large data sets. These applications write their data only once but they read it one or more times and require these reads to be satisfied at streaming speeds. HDFS supports write-once-read-many semantics on files. A typical block size used by HDFS is 64 MB. Thus, an HDFS file is chopped up into 64 MB chunks, and if possible, each chunk will reside on a different DataNode.
### Staging
A client request to create a file does not reach the NameNode immediately. In fact, initially the HDFS client caches the file data into a temporary local file. Application writes are transparently redirected to this temporary local file. When the local file accumulates data worth over one HDFS block size, the client contacts the NameNode. The NameNode inserts the file name into the file system hierarchy and allocates a data block for it. The NameNode responds to the client request with the identity of the DataNode and the destination data block. Then the client flushes the block of data from the local temporary file to the specified DataNode. When a file is closed, the remaining un-flushed data in the temporary local file is transferred to the DataNode. The client then tells the NameNode that the file is closed. At this point, the NameNode commits the file creation operation into a persistent store. If the NameNode dies before the file is closed, the file is lost.
The above approach has been adopted after careful consideration of target applications that run on HDFS. These applications need streaming writes to files. If a client writes to a remote file directly without any client side buffering, the network speed and the congestion in the network impacts throughput considerably. This approach is not without precedent. Earlier distributed file systems, e.g. AFS, have used client side caching to improve performance. A POSIX requirement has been relaxed to achieve higher performance of data uploads.
### Replication Pipelining
When a client is writing data to an HDFS file, its data is first written to a local file as explained in the previous section. Suppose the HDFS file has a replication factor of three. When the local file accumulates a full block of user data, the client retrieves a list of DataNodes from the NameNode. This list contains the DataNodes that will host a replica of that block. The client then flushes the data block to the first DataNode. The first DataNode starts receiving the data in small portions, writes each portion to its local repository and transfers that portion to the second DataNode in the list. The second DataNode, in turn starts receiving each portion of the data block, writes that portion to its repository and then flushes that portion to the third DataNode. Finally, the third DataNode writes the data to its local repository. Thus, a DataNode can be receiving data from the previous one in the pipeline and at the same time forwarding data to the next one in the pipeline. Thus, the data is pipelined from one DataNode to the next.
Accessibility
-------------
HDFS can be accessed from applications in many different ways. Natively, HDFS provides a [FileSystem Java API](http://hadoop.apache.org/docs/current/api/) for applications to use. A C language wrapper for this Java API is also available. In addition, an HTTP browser can also be used to browse the files of an HDFS instance. Work is in progress to expose HDFS through the WebDAV protocol.
### FS Shell
HDFS allows user data to be organized in the form of files and directories. It provides a commandline interface called FS shell that lets a user interact with the data in HDFS. The syntax of this command set is similar to other shells (e.g. bash, csh) that users are already familiar with. Here are some sample action/command pairs:
| Action | Command |
|:---- |:---- |
| Create a directory named `/foodir` | `bin/hadoop dfs -mkdir /foodir` |
| Remove a directory named `/foodir` | `bin/hadoop fs -rm -R /foodir` |
| View the contents of a file named `/foodir/myfile.txt` | `bin/hadoop dfs -cat /foodir/myfile.txt` |
FS shell is targeted for applications that need a scripting language to interact with the stored data.
### DFSAdmin
The DFSAdmin command set is used for administering an HDFS cluster. These are commands that are used only by an HDFS administrator. Here are some sample action/command pairs:
| Action | Command |
|:---- |:---- |
| Put the cluster in Safemode | `bin/hdfs dfsadmin -safemode enter` |
| Generate a list of DataNodes | `bin/hdfs dfsadmin -report` |
| Recommission or decommission DataNode(s) | `bin/hdfs dfsadmin -refreshNodes` |
### Browser Interface
A typical HDFS install configures a web server to expose the HDFS namespace through a configurable TCP port. This allows a user to navigate the HDFS namespace and view the contents of its files using a web browser.
Space Reclamation
-----------------
### File Deletes and Undeletes
When a file is deleted by a user or an application, it is not immediately removed from HDFS. Instead, HDFS first renames it to a file in the `/trash` directory. The file can be restored quickly as long as it remains in `/trash`. A file remains in `/trash` for a configurable amount of time. After the expiry of its life in `/trash`, the NameNode deletes the file from the HDFS namespace. The deletion of a file causes the blocks associated with the file to be freed. Note that there could be an appreciable time delay between the time a file is deleted by a user and the time of the corresponding increase in free space in HDFS.
A user can Undelete a file after deleting it as long as it remains in the `/trash` directory. If a user wants to undelete a file that he/she has deleted, he/she can navigate the `/trash` directory and retrieve the file. The `/trash` directory contains only the latest copy of the file that was deleted. The `/trash` directory is just like any other directory with one special feature: HDFS applies specified policies to automatically delete files from this directory. Current default trash interval is set to 0 (Deletes file without storing in trash). This value is configurable parameter stored as `fs.trash.interval` stored in core-site.xml.
### Decrease Replication Factor
When the replication factor of a file is reduced, the NameNode selects excess replicas that can be deleted. The next Heartbeat transfers this information to the DataNode. The DataNode then removes the corresponding blocks and the corresponding free space appears in the cluster. Once again, there might be a time delay between the completion of the setReplication API call and the appearance of free space in the cluster.
References
----------
Hadoop [JavaDoc API](http://hadoop.apache.org/docs/current/api/).
HDFS source code: <http://hadoop.apache.org/version_control.html>

View File

@ -0,0 +1,69 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
Offline Edits Viewer Guide
==========================
* [Offline Edits Viewer Guide](#Offline_Edits_Viewer_Guide)
* [Overview](#Overview)
* [Usage](#Usage)
* [Case study: Hadoop cluster recovery](#Case_study:_Hadoop_cluster_recovery)
Overview
--------
Offline Edits Viewer is a tool to parse the Edits log file. The current processors are mostly useful for conversion between different formats, including XML which is human readable and easier to edit than native binary format.
The tool can parse the edits formats -18 (roughly Hadoop 0.19) and later. The tool operates on files only, it does not need Hadoop cluster to be running.
Input formats supported:
1. **binary**: native binary format that Hadoop uses internally
2. **xml**: XML format, as produced by xml processor, used if filename
has `.xml` (case insensitive) extension
The Offline Edits Viewer provides several output processors (unless stated otherwise the output of the processor can be converted back to original edits file):
1. **binary**: native binary format that Hadoop uses internally
2. **xml**: XML format
3. **stats**: prints out statistics, this cannot be converted back to
Edits file
Usage
-----
bash$ bin/hdfs oev -i edits -o edits.xml
| Flag | Description |
|:---- |:---- |
| [`-i` ; `--inputFile`] *input file* | Specify the input edits log file to process. Xml (case insensitive) extension means XML format otherwise binary format is assumed. Required. |
| [`-o` ; `--outputFile`] *output file* | Specify the output filename, if the specified output processor generates one. If the specified file already exists, it is silently overwritten. Required. |
| [`-p` ; `--processor`] *processor* | Specify the image processor to apply against the image file. Currently valid options are `binary`, `xml` (default) and `stats`. |
| [`-v` ; `--verbose`] | Print the input and output filenames and pipe output of processor to console as well as specified file. On extremely large files, this may increase processing time by an order of magnitude. |
| [`-h` ; `--help`] | Display the tool usage and help information and exit. |
Case study: Hadoop cluster recovery
-----------------------------------
In case there is some problem with hadoop cluster and the edits file is corrupted it is possible to save at least part of the edits file that is correct. This can be done by converting the binary edits to XML, edit it manually and then convert it back to binary. The most common problem is that the edits file is missing the closing record (record that has opCode -1). This should be recognized by the tool and the XML format should be properly closed.
If there is no closing record in the XML file you can add one after last correct record. Anything after the record with opCode -1 is ignored.
Example of a closing record (with opCode -1):
<RECORD>
<OPCODE>-1</OPCODE>
<DATA>
</DATA>
</RECORD>

View File

@ -0,0 +1,172 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
Offline Image Viewer Guide
==========================
* [Offline Image Viewer Guide](#Offline_Image_Viewer_Guide)
* [Overview](#Overview)
* [Usage](#Usage)
* [Web Processor](#Web_Processor)
* [XML Processor](#XML_Processor)
* [Options](#Options)
* [Analyzing Results](#Analyzing_Results)
* [oiv\_legacy Command](#oiv_legacy_Command)
* [Usage](#Usage)
* [Options](#Options)
Overview
--------
The Offline Image Viewer is a tool to dump the contents of hdfs fsimage files to a human-readable format and provide read-only WebHDFS API in order to allow offline analysis and examination of an Hadoop cluster's namespace. The tool is able to process very large image files relatively quickly. The tool handles the layout formats that were included with Hadoop versions 2.4 and up. If you want to handle older layout formats, you can use the Offline Image Viewer of Hadoop 2.3 or [oiv\_legacy Command](#oiv_legacy_Command). If the tool is not able to process an image file, it will exit cleanly. The Offline Image Viewer does not require a Hadoop cluster to be running; it is entirely offline in its operation.
The Offline Image Viewer provides several output processors:
1. Web is the default output processor. It launches a HTTP server
that exposes read-only WebHDFS API. Users can investigate the namespace
interactively by using HTTP REST API.
2. XML creates an XML document of the fsimage and includes all of the
information within the fsimage, similar to the lsr processor. The
output of this processor is amenable to automated processing and
analysis with XML tools. Due to the verbosity of the XML syntax,
this processor will also generate the largest amount of output.
3. FileDistribution is the tool for analyzing file sizes in the
namespace image. In order to run the tool one should define a range
of integers [0, maxSize] by specifying maxSize and a step. The
range of integers is divided into segments of size step: [0, s[1],
..., s[n-1], maxSize], and the processor calculates how many files
in the system fall into each segment [s[i-1], s[i]). Note that
files larger than maxSize always fall into the very last segment.
The output file is formatted as a tab separated two column table:
Size and NumFiles. Where Size represents the start of the segment,
and numFiles is the number of files form the image which size falls
in this segment.
Usage
-----
### Web Processor
Web processor launches a HTTP server which exposes read-only WebHDFS API. Users can specify the address to listen by -addr option (default by localhost:5978).
bash$ bin/hdfs oiv -i fsimage
14/04/07 13:25:14 INFO offlineImageViewer.WebImageViewer: WebImageViewer
started. Listening on /127.0.0.1:5978. Press Ctrl+C to stop the viewer.
Users can access the viewer and get the information of the fsimage by the following shell command:
bash$ bin/hdfs dfs -ls webhdfs://127.0.0.1:5978/
Found 2 items
drwxrwx--* - root supergroup 0 2014-03-26 20:16 webhdfs://127.0.0.1:5978/tmp
drwxr-xr-x - root supergroup 0 2014-03-31 14:08 webhdfs://127.0.0.1:5978/user
To get the information of all the files and directories, you can simply use the following command:
bash$ bin/hdfs dfs -ls -R webhdfs://127.0.0.1:5978/
Users can also get JSON formatted FileStatuses via HTTP REST API.
bash$ curl -i http://127.0.0.1:5978/webhdfs/v1/?op=liststatus
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 252
{"FileStatuses":{"FileStatus":[
{"fileId":16386,"accessTime":0,"replication":0,"owner":"theuser","length":0,"permission":"755","blockSize":0,"modificationTime":1392772497282,"type":"DIRECTORY","group":"supergroup","childrenNum":1,"pathSuffix":"user"}
]}}
The Web processor now supports the following operations:
* [LISTSTATUS](./WebHDFS.html#List_a_Directory)
* [GETFILESTATUS](./WebHDFS.html#Status_of_a_FileDirectory)
* [GETACLSTATUS](./WebHDFS.html#Get_ACL_Status)
### XML Processor
XML Processor is used to dump all the contents in the fsimage. Users can specify input and output file via -i and -o command-line.
bash$ bin/hdfs oiv -p XML -i fsimage -o fsimage.xml
This will create a file named fsimage.xml contains all the information in the fsimage. For very large image files, this process may take several minutes.
Applying the Offline Image Viewer with XML processor would result in the following output:
<?xml version="1.0"?>
<fsimage>
<NameSection>
<genstampV1>1000</genstampV1>
<genstampV2>1002</genstampV2>
<genstampV1Limit>0</genstampV1Limit>
<lastAllocatedBlockId>1073741826</lastAllocatedBlockId>
<txid>37</txid>
</NameSection>
<INodeSection>
<lastInodeId>16400</lastInodeId>
<inode>
<id>16385</id>
<type>DIRECTORY</type>
<name></name>
<mtime>1392772497282</mtime>
<permission>theuser:supergroup:rwxr-xr-x</permission>
<nsquota>9223372036854775807</nsquota>
<dsquota>-1</dsquota>
</inode>
...remaining output omitted...
Options
-------
| **Flag** | **Description** |
|:---- |:---- |
| `-i`\|`--inputFile` *input file* | Specify the input fsimage file to process. Required. |
| `-o`\|`--outputFile` *output file* | Specify the output filename, if the specified output processor generates one. If the specified file already exists, it is silently overwritten. (output to stdout by default)\|
| `-p`\|`--processor` *processor* | Specify the image processor to apply against the image file. Currently valid options are Web (default), XML and FileDistribution. |
| `-addr` *address* | Specify the address(host:port) to listen. (localhost:5978 by default). This option is used with Web processor. |
| `-maxSize` *size* | Specify the range [0, maxSize] of file sizes to be analyzed in bytes (128GB by default). This option is used with FileDistribution processor. |
| `-step` *size* | Specify the granularity of the distribution in bytes (2MB by default). This option is used with FileDistribution processor. |
| `-h`\|`--help` | Display the tool usage and help information and exit. |
Analyzing Results
-----------------
The Offline Image Viewer makes it easy to gather large amounts of data about the hdfs namespace. This information can then be used to explore file system usage patterns or find specific files that match arbitrary criteria, along with other types of namespace analysis.
oiv\_legacy Command
-------------------
Due to the internal layout changes introduced by the ProtocolBuffer-based fsimage ([HDFS-5698](https://issues.apache.org/jira/browse/HDFS-5698)), OfflineImageViewer consumes excessive amount of memory and loses some functions such as Indented and Delimited processor. If you want to process without large amount of memory or use these processors, you can use `oiv_legacy` command (same as `oiv` in Hadoop 2.3).
### Usage
1. Set `dfs.namenode.legacy-oiv-image.dir` to an appropriate directory
to make standby NameNode or SecondaryNameNode save its namespace in the
old fsimage format during checkpointing.
2. Use `oiv_legacy` command to the old format fsimage.
bash$ bin/hdfs oiv_legacy -i fsimage_old -o output
### Options
| **Flag** | **Description** |
|:---- |:---- |
| `-i`\|`--inputFile` *input file* | Specify the input fsimage file to process. Required. |
| `-o`\|`--outputFile` *output file* | Specify the output filename, if the specified output processor generates one. If the specified file already exists, it is silently overwritten. Required. |
| `-p`\|`--processor` *processor* | Specify the image processor to apply against the image file. Valid options are Ls (default), XML, Delimited, Indented, and FileDistribution. |
| `-skipBlocks` | Do not enumerate individual blocks within files. This may save processing time and outfile file space on namespaces with very large files. The Ls processor reads the blocks to correctly determine file sizes and ignores this option. |
| `-printToScreen` | Pipe output of processor to console as well as specified file. On extremely large namespaces, this may increase processing time by an order of magnitude. |
| `-delimiter` *arg* | When used in conjunction with the Delimited processor, replaces the default tab delimiter with the string specified by *arg*. |
| `-h`\|`--help` | Display the tool usage and help information and exit. |

View File

@ -0,0 +1,127 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
HDFS Support for Multihomed Networks
====================================
This document is targetted to cluster administrators deploying `HDFS` in multihomed networks. Similar support for `YARN`/`MapReduce` is work in progress and will be documented when available.
* [HDFS Support for Multihomed Networks](#HDFS_Support_for_Multihomed_Networks)
* [Multihoming Background](#Multihoming_Background)
* [Fixing Hadoop Issues In Multihomed Environments](#Fixing_Hadoop_Issues_In_Multihomed_Environments)
* [Ensuring HDFS Daemons Bind All Interfaces](#Ensuring_HDFS_Daemons_Bind_All_Interfaces)
* [Clients use Hostnames when connecting to DataNodes](#Clients_use_Hostnames_when_connecting_to_DataNodes)
* [DataNodes use HostNames when connecting to other DataNodes](#DataNodes_use_HostNames_when_connecting_to_other_DataNodes)
Multihoming Background
----------------------
In multihomed networks the cluster nodes are connected to more than one network interface. There could be multiple reasons for doing so.
1. **Security**: Security requirements may dictate that intra-cluster
traffic be confined to a different network than the network used to
transfer data in and out of the cluster.
2. **Performance**: Intra-cluster traffic may use one or more high bandwidth
interconnects like Fiber Channel, Infiniband or 10GbE.
3. **Failover/Redundancy**: The nodes may have multiple network adapters
connected to a single network to handle network adapter failure.
Note that NIC Bonding (also known as NIC Teaming or Link
Aggregation) is a related but separate topic. The following settings
are usually not applicable to a NIC bonding configuration which handles
multiplexing and failover transparently while presenting a single 'logical
network' to applications.
Fixing Hadoop Issues In Multihomed Environments
-----------------------------------------------
### Ensuring HDFS Daemons Bind All Interfaces
By default `HDFS` endpoints are specified as either hostnames or IP addresses. In either case `HDFS` daemons will bind to a single IP address making the daemons unreachable from other networks.
The solution is to have separate setting for server endpoints to force binding the wildcard IP address `INADDR_ANY` i.e. `0.0.0.0`. Do NOT supply a port number with any of these settings.
<property>
<name>dfs.namenode.rpc-bind-host</name>
<value>0.0.0.0</value>
<description>
The actual address the RPC server will bind to. If this optional address is
set, it overrides only the hostname portion of dfs.namenode.rpc-address.
It can also be specified per name node or name service for HA/Federation.
This is useful for making the name node listen on all interfaces by
setting it to 0.0.0.0.
</description>
</property>
<property>
<name>dfs.namenode.servicerpc-bind-host</name>
<value>0.0.0.0</value>
<description>
The actual address the service RPC server will bind to. If this optional address is
set, it overrides only the hostname portion of dfs.namenode.servicerpc-address.
It can also be specified per name node or name service for HA/Federation.
This is useful for making the name node listen on all interfaces by
setting it to 0.0.0.0.
</description>
</property>
<property>
<name>dfs.namenode.http-bind-host</name>
<value>0.0.0.0</value>
<description>
The actual adress the HTTP server will bind to. If this optional address
is set, it overrides only the hostname portion of dfs.namenode.http-address.
It can also be specified per name node or name service for HA/Federation.
This is useful for making the name node HTTP server listen on all
interfaces by setting it to 0.0.0.0.
</description>
</property>
<property>
<name>dfs.namenode.https-bind-host</name>
<value>0.0.0.0</value>
<description>
The actual adress the HTTPS server will bind to. If this optional address
is set, it overrides only the hostname portion of dfs.namenode.https-address.
It can also be specified per name node or name service for HA/Federation.
This is useful for making the name node HTTPS server listen on all
interfaces by setting it to 0.0.0.0.
</description>
</property>
### Clients use Hostnames when connecting to DataNodes
By default `HDFS` clients connect to DataNodes using the IP address provided by the NameNode. Depending on the network configuration this IP address may be unreachable by the clients. The fix is letting clients perform their own DNS resolution of the DataNode hostname. The following setting enables this behavior.
<property>
<name>dfs.client.use.datanode.hostname</name>
<value>true</value>
<description>Whether clients should use datanode hostnames when
connecting to datanodes.
</description>
</property>
### DataNodes use HostNames when connecting to other DataNodes
Rarely, the NameNode-resolved IP address for a DataNode may be unreachable from other DataNodes. The fix is to force DataNodes to perform their own DNS resolution for inter-DataNode connections. The following setting enables this behavior.
<property>
<name>dfs.datanode.use.datanode.hostname</name>
<value>true</value>
<description>Whether datanodes should use datanode hostnames when
connecting to other datanodes for data transfer.
</description>
</property>

View File

@ -0,0 +1,254 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
HDFS NFS Gateway
================
* [HDFS NFS Gateway](#HDFS_NFS_Gateway)
* [Overview](#Overview)
* [Configuration](#Configuration)
* [Start and stop NFS gateway service](#Start_and_stop_NFS_gateway_service)
* [Verify validity of NFS related services](#Verify_validity_of_NFS_related_services)
* [Mount the export "/"](#Mount_the_export_)
* [Allow mounts from unprivileged clients](#Allow_mounts_from_unprivileged_clients)
* [User authentication and mapping](#User_authentication_and_mapping)
Overview
--------
The NFS Gateway supports NFSv3 and allows HDFS to be mounted as part of the client's local file system. Currently NFS Gateway supports and enables the following usage patterns:
* Users can browse the HDFS file system through their local file system
on NFSv3 client compatible operating systems.
* Users can download files from the the HDFS file system on to their
local file system.
* Users can upload files from their local file system directly to the
HDFS file system.
* Users can stream data directly to HDFS through the mount point. File
append is supported but random write is not supported.
The NFS gateway machine needs the same thing to run an HDFS client like Hadoop JAR files, HADOOP\_CONF directory. The NFS gateway can be on the same host as DataNode, NameNode, or any HDFS client.
Configuration
-------------
The NFS-gateway uses proxy user to proxy all the users accessing the NFS mounts. In non-secure mode, the user running the gateway is the proxy user, while in secure mode the user in Kerberos keytab is the proxy user. Suppose the proxy user is 'nfsserver' and users belonging to the groups 'users-group1' and 'users-group2' use the NFS mounts, then in core-site.xml of the NameNode, the following two properities must be set and only NameNode needs restart after the configuration change (NOTE: replace the string 'nfsserver' with the proxy user name in your cluster):
<property>
<name>hadoop.proxyuser.nfsserver.groups</name>
<value>root,users-group1,users-group2</value>
<description>
The 'nfsserver' user is allowed to proxy all members of the 'users-group1' and
'users-group2' groups. Note that in most cases you will need to include the
group "root" because the user "root" (which usually belonges to "root" group) will
generally be the user that initially executes the mount on the NFS client system.
Set this to '*' to allow nfsserver user to proxy any group.
</description>
</property>
<property>
<name>hadoop.proxyuser.nfsserver.hosts</name>
<value>nfs-client-host1.com</value>
<description>
This is the host where the nfs gateway is running. Set this to '*' to allow
requests from any hosts to be proxied.
</description>
</property>
The above are the only required configuration for the NFS gateway in non-secure mode. For Kerberized hadoop clusters, the following configurations need to be added to hdfs-site.xml for the gateway (NOTE: replace string "nfsserver" with the proxy user name and ensure the user contained in the keytab is also the same proxy user):
<property>
<name>nfs.keytab.file</name>
<value>/etc/hadoop/conf/nfsserver.keytab</value> <!-- path to the nfs gateway keytab -->
</property>
<property>
<name>nfs.kerberos.principal</name>
<value>nfsserver/_HOST@YOUR-REALM.COM</value>
</property>
The rest of the NFS gateway configurations are optional for both secure and non-secure mode.
The AIX NFS client has a [few known issues](https://issues.apache.org/jira/browse/HDFS-6549) that prevent it from working correctly by default with the HDFS NFS Gateway. If you want to be able to access the HDFS NFS Gateway from AIX, you should set the following configuration setting to enable work-arounds for these issues:
<property>
<name>nfs.aix.compatibility.mode.enabled</name>
<value>true</value>
</property>
Note that regular, non-AIX clients should NOT enable AIX compatibility mode. The work-arounds implemented by AIX compatibility mode effectively disable safeguards to ensure that listing of directory contents via NFS returns consistent results, and that all data sent to the NFS server can be assured to have been committed.
It's strongly recommended for the users to update a few configuration properties based on their use cases. All the following configuration properties can be added or updated in hdfs-site.xml.
* If the client mounts the export with access time update allowed, make sure the following
property is not disabled in the configuration file. Only NameNode needs to restart after
this property is changed. On some Unix systems, the user can disable access time update
by mounting the export with "noatime". If the export is mounted with "noatime", the user
doesn't need to change the following property and thus no need to restart namenode.
<property>
<name>dfs.namenode.accesstime.precision</name>
<value>3600000</value>
<description>The access time for HDFS file is precise upto this value.
The default value is 1 hour. Setting a value of 0 disables
access times for HDFS.
</description>
</property>
* Users are expected to update the file dump directory. NFS client often
reorders writes. Sequential writes can arrive at the NFS gateway at random
order. This directory is used to temporarily save out-of-order writes
before writing to HDFS. For each file, the out-of-order writes are dumped after
they are accumulated to exceed certain threshold (e.g., 1MB) in memory.
One needs to make sure the directory has enough
space. For example, if the application uploads 10 files with each having
100MB, it is recommended for this directory to have roughly 1GB space in case if a
worst-case write reorder happens to every file. Only NFS gateway needs to restart after
this property is updated.
<property>
<name>nfs.dump.dir</name>
<value>/tmp/.hdfs-nfs</value>
</property>
* By default, the export can be mounted by any client. To better control the access,
users can update the following property. The value string contains machine name and
access privilege, separated by whitespace
characters. The machine name format can be a single host, a Java regular expression, or an IPv4 address. The access
privilege uses rw or ro to specify read/write or read-only access of the machines to exports. If the access privilege is not provided, the default is read-only. Entries are separated by ";".
For example: "192.168.0.0/22 rw ; host.\*\\.example\\.com ; host1.test.org ro;". Only the NFS gateway needs to restart after
this property is updated.
<property>
<name>nfs.exports.allowed.hosts</name>
<value>* rw</value>
</property>
* JVM and log settings. You can export JVM settings (e.g., heap size and GC log) in
HADOOP\_NFS3\_OPTS. More NFS related settings can be found in hadoop-env.sh.
To get NFS debug trace, you can edit the log4j.property file
to add the following. Note, debug trace, especially for ONCRPC, can be very verbose.
To change logging level:
log4j.logger.org.apache.hadoop.hdfs.nfs=DEBUG
To get more details of ONCRPC requests:
log4j.logger.org.apache.hadoop.oncrpc=DEBUG
Start and stop NFS gateway service
----------------------------------
Three daemons are required to provide NFS service: rpcbind (or portmap), mountd and nfsd. The NFS gateway process has both nfsd and mountd. It shares the HDFS root "/" as the only export. It is recommended to use the portmap included in NFS gateway package. Even though NFS gateway works with portmap/rpcbind provide by most Linux distributions, the package included portmap is needed on some Linux systems such as REHL6.2 due to an [rpcbind bug](https://bugzilla.redhat.com/show_bug.cgi?id=731542). More detailed discussions can be found in [HDFS-4763](https://issues.apache.org/jira/browse/HDFS-4763).
1. Stop nfsv3 and rpcbind/portmap services provided by the platform (commands can be different on various Unix platforms):
[root]> service nfs stop
[root]> service rpcbind stop
2. Start Hadoop's portmap (needs root privileges):
[root]> $HADOOP_PREFIX/bin/hdfs --daemon start portmap
3. Start mountd and nfsd.
No root privileges are required for this command. In non-secure mode, the NFS gateway
should be started by the proxy user mentioned at the beginning of this user guide.
While in secure mode, any user can start NFS gateway
as long as the user has read access to the Kerberos keytab defined in "nfs.keytab.file".
[hdfs]$ $HADOOP_PREFIX/bin/hdfs --daemon start nfs3
4. Stop NFS gateway services.
[hdfs]$ $HADOOP_PREFIX/bin/hdfs --daemon stop nfs3
[root]> $HADOOP_PREFIX/bin/hdfs --daemon stop portmap
Optionally, you can forgo running the Hadoop-provided portmap daemon and instead use the system portmap daemon on all operating systems if you start the NFS Gateway as root. This will allow the HDFS NFS Gateway to work around the aforementioned bug and still register using the system portmap daemon. To do so, just start the NFS gateway daemon as you normally would, but make sure to do so as the "root" user, and also set the "HADOOP\_PRIVILEGED\_NFS\_USER" environment variable to an unprivileged user. In this mode the NFS Gateway will start as root to perform its initial registration with the system portmap, and then will drop privileges back to the user specified by the HADOOP\_PRIVILEGED\_NFS\_USER afterward and for the rest of the duration of the lifetime of the NFS Gateway process. Note that if you choose this route, you should skip steps 1 and 2 above.
Verify validity of NFS related services
---------------------------------------
1. Execute the following command to verify if all the services are up and running:
[root]> rpcinfo -p $nfs_server_ip
You should see output similar to the following:
program vers proto port
100005 1 tcp 4242 mountd
100005 2 udp 4242 mountd
100005 2 tcp 4242 mountd
100000 2 tcp 111 portmapper
100000 2 udp 111 portmapper
100005 3 udp 4242 mountd
100005 1 udp 4242 mountd
100003 3 tcp 2049 nfs
100005 3 tcp 4242 mountd
2. Verify if the HDFS namespace is exported and can be mounted.
[root]> showmount -e $nfs_server_ip
You should see output similar to the following:
Exports list on $nfs_server_ip :
/ (everyone)
Mount the export "/"
--------------------
Currently NFS v3 only uses TCP as the transportation protocol. NLM is not supported so mount option "nolock" is needed. It's recommended to use hard mount. This is because, even after the client sends all data to NFS gateway, it may take NFS gateway some extra time to transfer data to HDFS when writes were reorderd by NFS client Kernel.
If soft mount has to be used, the user should give it a relatively long timeout (at least no less than the default timeout on the host) .
The users can mount the HDFS namespace as shown below:
[root]>mount -t nfs -o vers=3,proto=tcp,nolock,noacl $server:/ $mount_point
Then the users can access HDFS as part of the local file system except that, hard link and random write are not supported yet. To optimize the performance of large file I/O, one can increase the NFS transfer size(rsize and wsize) during mount. By default, NFS gateway supports 1MB as the maximum transfer size. For larger data transfer size, one needs to update "nfs.rtmax" and "nfs.rtmax" in hdfs-site.xml.
Allow mounts from unprivileged clients
--------------------------------------
In environments where root access on client machines is not generally available, some measure of security can be obtained by ensuring that only NFS clients originating from privileged ports can connect to the NFS server. This feature is referred to as "port monitoring." This feature is not enabled by default in the HDFS NFS Gateway, but can be optionally enabled by setting the following config in hdfs-site.xml on the NFS Gateway machine:
<property>
<name>nfs.port.monitoring.disabled</name>
<value>false</value>
</property>
User authentication and mapping
-------------------------------
NFS gateway in this release uses AUTH\_UNIX style authentication. When the user on NFS client accesses the mount point, NFS client passes the UID to NFS gateway. NFS gateway does a lookup to find user name from the UID, and then passes the username to the HDFS along with the HDFS requests. For example, if the NFS client has current user as "admin", when the user accesses the mounted directory, NFS gateway will access HDFS as user "admin". To access HDFS as the user "hdfs", one needs to switch the current user to "hdfs" on the client system when accessing the mounted directory.
The system administrator must ensure that the user on NFS client host has the same name and UID as that on the NFS gateway host. This is usually not a problem if the same user management system (e.g., LDAP/NIS) is used to create and deploy users on HDFS nodes and NFS client node. In case the user account is created manually on different hosts, one might need to modify UID (e.g., do "usermod -u 123 myusername") on either NFS client or NFS gateway host in order to make it the same on both sides. More technical details of RPC AUTH\_UNIX can be found in [RPC specification](http://tools.ietf.org/html/rfc1057).
Optionally, the system administrator can configure a custom static mapping file in the event one wishes to access the HDFS NFS Gateway from a system with a completely disparate set of UIDs/GIDs. By default this file is located at "/etc/nfs.map", but a custom location can be configured by setting the "static.id.mapping.file" property to the path of the static mapping file. The format of the static mapping file is similar to what is described in the exports(5) manual page, but roughly it is:
# Mapping for clients accessing the NFS gateway
uid 10 100 # Map the remote UID 10 the local UID 100
gid 11 101 # Map the remote GID 11 to the local GID 101

View File

@ -0,0 +1,284 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
HDFS Permissions Guide
======================
* [HDFS Permissions Guide](#HDFS_Permissions_Guide)
* [Overview](#Overview)
* [User Identity](#User_Identity)
* [Group Mapping](#Group_Mapping)
* [Understanding the Implementation](#Understanding_the_Implementation)
* [Changes to the File System API](#Changes_to_the_File_System_API)
* [Changes to the Application Shell](#Changes_to_the_Application_Shell)
* [The Super-User](#The_Super-User)
* [The Web Server](#The_Web_Server)
* [ACLs (Access Control Lists)](#ACLs_Access_Control_Lists)
* [ACLs File System API](#ACLs_File_System_API)
* [ACLs Shell Commands](#ACLs_Shell_Commands)
* [Configuration Parameters](#Configuration_Parameters)
Overview
--------
The Hadoop Distributed File System (HDFS) implements a permissions model for files and directories that shares much of the POSIX model. Each file and directory is associated with an owner and a group. The file or directory has separate permissions for the user that is the owner, for other users that are members of the group, and for all other users. For files, the r permission is required to read the file, and the w permission is required to write or append to the file. For directories, the r permission is required to list the contents of the directory, the w permission is required to create or delete files or directories, and the x permission is required to access a child of the directory.
In contrast to the POSIX model, there are no setuid or setgid bits for files as there is no notion of executable files. For directories, there are no setuid or setgid bits directory as a simplification. The Sticky bit can be set on directories, preventing anyone except the superuser, directory owner or file owner from deleting or moving the files within the directory. Setting the sticky bit for a file has no effect. Collectively, the permissions of a file or directory are its mode. In general, Unix customs for representing and displaying modes will be used, including the use of octal numbers in this description. When a file or directory is created, its owner is the user identity of the client process, and its group is the group of the parent directory (the BSD rule).
HDFS also provides optional support for POSIX ACLs (Access Control Lists) to augment file permissions with finer-grained rules for specific named users or named groups. ACLs are discussed in greater detail later in this document.
Each client process that accesses HDFS has a two-part identity composed of the user name, and groups list. Whenever HDFS must do a permissions check for a file or directory foo accessed by a client process,
* If the user name matches the owner of foo, then the owner permissions are tested;
* Else if the group of foo matches any of member of the groups list, then the group permissions are tested;
* Otherwise the other permissions of foo are tested.
If a permissions check fails, the client operation fails.
User Identity
-------------
As of Hadoop 0.22, Hadoop supports two different modes of operation to determine the user's identity, specified by the hadoop.security.authentication property:
* **simple**
In this mode of operation, the identity of a client process is
determined by the host operating system. On Unix-like systems,
the user name is the equivalent of \`whoami\`.
* **kerberos**
In Kerberized operation, the identity of a client process is
determined by its Kerberos credentials. For example, in a
Kerberized environment, a user may use the kinit utility to
obtain a Kerberos ticket-granting-ticket (TGT) and use klist to
determine their current principal. When mapping a Kerberos
principal to an HDFS username, all components except for the
primary are dropped. For example, a principal
todd/foobar@CORP.COMPANY.COM will act as the simple username todd on HDFS.
Regardless of the mode of operation, the user identity mechanism is extrinsic to HDFS itself. There is no provision within HDFS for creating user identities, establishing groups, or processing user credentials.
Group Mapping
-------------
Once a username has been determined as described above, the list of groups is determined by a group mapping service, configured by the hadoop.security.group.mapping property. The default implementation, org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback, will determine if the Java Native Interface (JNI) is available. If JNI is available, the implementation will use the API within hadoop to resolve a list of groups for a user. If JNI is not available then the shell implementation, org.apache.hadoop.security.ShellBasedUnixGroupsMapping, is used. This implementation shells out with the `bash -c groups` command (for a Linux/Unix environment) or the `net group` command (for a Windows environment) to resolve a list of groups for a user.
An alternate implementation, which connects directly to an LDAP server to resolve the list of groups, is available via org.apache.hadoop.security.LdapGroupsMapping. However, this provider should only be used if the required groups reside exclusively in LDAP, and are not materialized on the Unix servers. More information on configuring the group mapping service is available in the Javadocs.
For HDFS, the mapping of users to groups is performed on the NameNode. Thus, the host system configuration of the NameNode determines the group mappings for the users.
Note that HDFS stores the user and group of a file or directory as strings; there is no conversion from user and group identity numbers as is conventional in Unix.
Understanding the Implementation
--------------------------------
Each file or directory operation passes the full path name to the name node, and the permissions checks are applied along the path for each operation. The client framework will implicitly associate the user identity with the connection to the name node, reducing the need for changes to the existing client API. It has always been the case that when one operation on a file succeeds, the operation might fail when repeated because the file, or some directory on the path, no longer exists. For instance, when the client first begins reading a file, it makes a first request to the name node to discover the location of the first blocks of the file. A second request made to find additional blocks may fail. On the other hand, deleting a file does not revoke access by a client that already knows the blocks of the file. With the addition of permissions, a client's access to a file may be withdrawn between requests. Again, changing permissions does not revoke the access of a client that already knows the file's blocks.
Changes to the File System API
------------------------------
All methods that use a path parameter will throw `AccessControlException` if permission checking fails.
New methods:
* `public FSDataOutputStream create(Path f, FsPermission permission, boolean overwrite, int bufferSize, short replication, long
blockSize, Progressable progress) throws IOException;`
* `public boolean mkdirs(Path f, FsPermission permission) throws IOException;`
* `public void setPermission(Path p, FsPermission permission) throws IOException;`
* `public void setOwner(Path p, String username, String groupname) throws IOException;`
* `public FileStatus getFileStatus(Path f) throws IOException;`will additionally return the user, group and mode associated with the path.
The mode of a new file or directory is restricted my the umask set as a configuration parameter. When the existing `create(path, …)` method (without the permission parameter) is used, the mode of the new file is `0666 & ^umask`. When the new `create(path, permission, …)` method (with the permission parameter P) is used, the mode of the new file is `P & ^umask & 0666`. When a new directory is created with the existing `mkdirs(path)` method (without the permission parameter), the mode of the new directory is `0777 & ^umask`. When the new `mkdirs(path, permission)` method (with the permission parameter P) is used, the mode of new directory is `P & ^umask & 0777`.
Changes to the Application Shell
--------------------------------
New operations:
* `chmod [-R] mode file ...`
Only the owner of a file or the super-user is permitted to change the mode of a file.
* `chgrp [-R] group file ...`
The user invoking chgrp must belong to the specified group and be
the owner of the file, or be the super-user.
* `chown [-R] [owner][:[group]] file ...`
The owner of a file may only be altered by a super-user.
* `ls file ...`
* `lsr file ...`
The output is reformatted to display the owner, group and mode.
The Super-User
--------------
The super-user is the user with the same identity as name node process itself. Loosely, if you started the name node, then you are the super-user. The super-user can do anything in that permissions checks never fail for the super-user. There is no persistent notion of who was the super-user; when the name node is started the process identity determines who is the super-user for now. The HDFS super-user does not have to be the super-user of the name node host, nor is it necessary that all clusters have the same super-user. Also, an experimenter running HDFS on a personal workstation, conveniently becomes that installation's super-user without any configuration.
In addition, the administrator my identify a distinguished group using a configuration parameter. If set, members of this group are also super-users.
The Web Server
--------------
By default, the identity of the web server is a configuration parameter. That is, the name node has no notion of the identity of the real user, but the web server behaves as if it has the identity (user and groups) of a user chosen by the administrator. Unless the chosen identity matches the super-user, parts of the name space may be inaccessible to the web server.
ACLs (Access Control Lists)
---------------------------
In addition to the traditional POSIX permissions model, HDFS also supports POSIX ACLs (Access Control Lists). ACLs are useful for implementing permission requirements that differ from the natural organizational hierarchy of users and groups. An ACL provides a way to set different permissions for specific named users or named groups, not only the file's owner and the file's group.
By default, support for ACLs is disabled, and the NameNode disallows creation of ACLs. To enable support for ACLs, set `dfs.namenode.acls.enabled` to true in the NameNode configuration.
An ACL consists of a set of ACL entries. Each ACL entry names a specific user or group and grants or denies read, write and execute permissions for that specific user or group. For example:
user::rw-
user:bruce:rwx #effective:r--
group::r-x #effective:r--
group:sales:rwx #effective:r--
mask::r--
other::r--
ACL entries consist of a type, an optional name and a permission string. For display purposes, ':' is used as the delimiter between each field. In this example ACL, the file owner has read-write access, the file group has read-execute access and others have read access. So far, this is equivalent to setting the file's permission bits to 654.
Additionally, there are 2 extended ACL entries for the named user bruce and the named group sales, both granted full access. The mask is a special ACL entry that filters the permissions granted to all named user entries and named group entries, and also the unnamed group entry. In the example, the mask has only read permissions, and we can see that the effective permissions of several ACL entries have been filtered accordingly.
Every ACL must have a mask. If the user doesn't supply a mask while setting an ACL, then a mask is inserted automatically by calculating the union of permissions on all entries that would be filtered by the mask.
Running `chmod` on a file that has an ACL actually changes the permissions of the mask. Since the mask acts as a filter, this effectively constrains the permissions of all extended ACL entries instead of changing just the group entry and possibly missing other extended ACL entries.
The model also differentiates between an "access ACL", which defines the rules to enforce during permission checks, and a "default ACL", which defines the ACL entries that new child files or sub-directories receive automatically during creation. For example:
user::rwx
group::r-x
other::r-x
default:user::rwx
default:user:bruce:rwx #effective:r-x
default:group::r-x
default:group:sales:rwx #effective:r-x
default:mask::r-x
default:other::r-x
Only directories may have a default ACL. When a new file or sub-directory is created, it automatically copies the default ACL of its parent into its own access ACL. A new sub-directory also copies it to its own default ACL. In this way, the default ACL will be copied down through arbitrarily deep levels of the file system tree as new sub-directories get created.
The exact permission values in the new child's access ACL are subject to filtering by the mode parameter. Considering the default umask of 022, this is typically 755 for new directories and 644 for new files. The mode parameter filters the copied permission values for the unnamed user (file owner), the mask and other. Using this particular example ACL, and creating a new sub-directory with 755 for the mode, this mode filtering has no effect on the final result. However, if we consider creation of a file with 644 for the mode, then mode filtering causes the new file's ACL to receive read-write for the unnamed user (file owner), read for the mask and read for others. This mask also means that effective permissions for named user bruce and named group sales are only read.
Note that the copy occurs at time of creation of the new file or sub-directory. Subsequent changes to the parent's default ACL do not change existing children.
The default ACL must have all minimum required ACL entries, including the unnamed user (file owner), unnamed group (file group) and other entries. If the user doesn't supply one of these entries while setting a default ACL, then the entries are inserted automatically by copying the corresponding permissions from the access ACL, or permission bits if there is no access ACL. The default ACL also must have mask. As described above, if the mask is unspecified, then a mask is inserted automatically by calculating the union of permissions on all entries that would be filtered by the mask.
When considering a file that has an ACL, the algorithm for permission checks changes to:
* If the user name matches the owner of file, then the owner
permissions are tested;
* Else if the user name matches the name in one of the named user entries,
then these permissions are tested, filtered by the mask permissions;
* Else if the group of file matches any member of the groups list,
and if these permissions filtered by the mask grant access, then these
permissions are used;
* Else if there is a named group entry matching a member of the groups list,
and if these permissions filtered by the mask grant access, then these
permissions are used;
* Else if the file group or any named group entry matches a member of the
groups list, but access was not granted by any of those permissions, then
access is denied;
* Otherwise the other permissions of file are tested.
Best practice is to rely on traditional permission bits to implement most permission requirements, and define a smaller number of ACLs to augment the permission bits with a few exceptional rules. A file with an ACL incurs an additional cost in memory in the NameNode compared to a file that has only permission bits.
ACLs File System API
--------------------
New methods:
* `public void modifyAclEntries(Path path, List<AclEntry> aclSpec) throws IOException;`
* `public void removeAclEntries(Path path, List<AclEntry> aclSpec) throws IOException;`
* `public void public void removeDefaultAcl(Path path) throws IOException;`
* `public void removeAcl(Path path) throws IOException;`
* `public void setAcl(Path path, List<AclEntry> aclSpec) throws IOException;`
* `public AclStatus getAclStatus(Path path) throws IOException;`
ACLs Shell Commands
-------------------
* `hdfs dfs -getfacl [-R] <path>`
Displays the Access Control Lists (ACLs) of files and directories. If a
directory has a default ACL, then getfacl also displays the default ACL.
* `hdfs dfs -setfacl [-R] [-b |-k -m |-x <acl_spec> <path>] |[--set <acl_spec> <path>] `
Sets Access Control Lists (ACLs) of files and directories.
* `hdfs dfs -ls <args>`
The output of `ls` will append a '+' character to the permissions
string of any file or directory that has an ACL.
See the [File System Shell](../hadoop-common/FileSystemShell.html)
documentation for full coverage of these commands.
Configuration Parameters
------------------------
* `dfs.permissions.enabled = true`
If yes use the permissions system as described here. If no,
permission checking is turned off, but all other behavior is
unchanged. Switching from one parameter value to the other does not
change the mode, owner or group of files or directories.
Regardless of whether permissions are on or off, chmod, chgrp, chown and
setfacl always check permissions. These functions are only useful in
the permissions context, and so there is no backwards compatibility
issue. Furthermore, this allows administrators to reliably set owners and permissions
in advance of turning on regular permissions checking.
* `dfs.web.ugi = webuser,webgroup`
The user name to be used by the web server. Setting this to the
name of the super-user allows any web client to see everything.
Changing this to an otherwise unused identity allows web clients to
see only those things visible using "other" permissions. Additional
groups may be added to the comma-separated list.
* `dfs.permissions.superusergroup = supergroup`
The name of the group of super-users.
* `fs.permissions.umask-mode = 0022`
The umask used when creating files and directories. For
configuration files, the decimal value 18 may be used.
* `dfs.cluster.administrators = ACL-for-admins`
The administrators for the cluster specified as an ACL. This
controls who can access the default servlets, etc. in the HDFS.
* `dfs.namenode.acls.enabled = true`
Set to true to enable support for HDFS ACLs (Access Control Lists). By
default, ACLs are disabled. When ACLs are disabled, the NameNode rejects
all attempts to set an ACL.

View File

@ -0,0 +1,92 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
HDFS Quotas Guide
=================
* [HDFS Quotas Guide](#HDFS_Quotas_Guide)
* [Overview](#Overview)
* [Name Quotas](#Name_Quotas)
* [Space Quotas](#Space_Quotas)
* [Administrative Commands](#Administrative_Commands)
* [Reporting Command](#Reporting_Command)
Overview
--------
The Hadoop Distributed File System (HDFS) allows the administrator to set quotas for the number of names used and the amount of space used for individual directories. Name quotas and space quotas operate independently, but the administration and implementation of the two types of quotas are closely parallel.
Name Quotas
-----------
The name quota is a hard limit on the number of file and directory names in the tree rooted at that directory. File and directory creations fail if the quota would be exceeded. Quotas stick with renamed directories; the rename operation fails if operation would result in a quota violation. The attempt to set a quota will still succeed even if the directory would be in violation of the new quota. A newly created directory has no associated quota. The largest quota is Long.Max\_Value. A quota of one forces a directory to remain empty. (Yes, a directory counts against its own quota!)
Quotas are persistent with the fsimage. When starting, if the fsimage is immediately in violation of a quota (perhaps the fsimage was surreptitiously modified), a warning is printed for each of such violations. Setting or removing a quota creates a journal entry.
Space Quotas
------------
The space quota is a hard limit on the number of bytes used by files in the tree rooted at that directory. Block allocations fail if the quota would not allow a full block to be written. Each replica of a block counts against the quota. Quotas stick with renamed directories; the rename operation fails if the operation would result in a quota violation. A newly created directory has no associated quota. The largest quota is `Long.Max_Value`. A quota of zero still permits files to be created, but no blocks can be added to the files. Directories don't use host file system space and don't count against the space quota. The host file system space used to save the file meta data is not counted against the quota. Quotas are charged at the intended replication factor for the file; changing the replication factor for a file will credit or debit quotas.
Quotas are persistent with the fsimage. When starting, if the fsimage is immediately in violation of a quota (perhaps the fsimage was surreptitiously modified), a warning is printed for each of such violations. Setting or removing a quota creates a journal entry.
Administrative Commands
-----------------------
Quotas are managed by a set of commands available only to the administrator.
* `hdfs dfsadmin -setQuota <N> <directory>...<directory>`
Set the name quota to be N for each directory. Best effort for each
directory, with faults reported if N is not a positive long
integer, the directory does not exist or it is a file, or the
directory would immediately exceed the new quota.
* `hdfs dfsadmin -clrQuota <directory>...<directory>`
Remove any name quota for each directory. Best effort for each
directory, with faults reported if the directory does not exist or
it is a file. It is not a fault if the directory has no quota.
* `hdfs dfsadmin -setSpaceQuota <N> <directory>...<directory>`
Set the space quota to be N bytes for each directory. This is a
hard limit on total size of all the files under the directory tree.
The space quota takes replication also into account, i.e. one GB of
data with replication of 3 consumes 3GB of quota. N can also be
specified with a binary prefix for convenience, for e.g. 50g for 50
gigabytes and 2t for 2 terabytes etc. Best effort for each
directory, with faults reported if N is neither zero nor a positive
integer, the directory does not exist or it is a file, or the
directory would immediately exceed the new quota.
* `hdfs dfsadmin -clrSpaceQuota <directory>...<directory>`
Remove any space quota for each directory. Best effort for each
directory, with faults reported if the directory does not exist or
it is a file. It is not a fault if the directory has no quota.
Reporting Command
-----------------
An an extension to the count command of the HDFS shell reports quota values and the current count of names and bytes in use.
* `hadoop fs -count -q <directory>...<directory>`
With the -q option, also report the name quota value set for each
directory, the available name quota remaining, the space quota
value set, and the available space quota remaining. If the
directory does not have a quota set, the reported values are `none`
and `inf`.

View File

@ -0,0 +1,375 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
HDFS Users Guide
================
* [HDFS Users Guide](#HDFS_Users_Guide)
* [Purpose](#Purpose)
* [Overview](#Overview)
* [Prerequisites](#Prerequisites)
* [Web Interface](#Web_Interface)
* [Shell Commands](#Shell_Commands)
* [DFSAdmin Command](#DFSAdmin_Command)
* [Secondary NameNode](#Secondary_NameNode)
* [Checkpoint Node](#Checkpoint_Node)
* [Backup Node](#Backup_Node)
* [Import Checkpoint](#Import_Checkpoint)
* [Balancer](#Balancer)
* [Rack Awareness](#Rack_Awareness)
* [Safemode](#Safemode)
* [fsck](#fsck)
* [fetchdt](#fetchdt)
* [Recovery Mode](#Recovery_Mode)
* [Upgrade and Rollback](#Upgrade_and_Rollback)
* [DataNode Hot Swap Drive](#DataNode_Hot_Swap_Drive)
* [File Permissions and Security](#File_Permissions_and_Security)
* [Scalability](#Scalability)
* [Related Documentation](#Related_Documentation)
Purpose
-------
This document is a starting point for users working with Hadoop Distributed File System (HDFS) either as a part of a Hadoop cluster or as a stand-alone general purpose distributed file system. While HDFS is designed to "just work" in many environments, a working knowledge of HDFS helps greatly with configuration improvements and diagnostics on a specific cluster.
Overview
--------
HDFS is the primary distributed storage used by Hadoop applications. A HDFS cluster primarily consists of a NameNode that manages the file system metadata and DataNodes that store the actual data. The HDFS Architecture Guide describes HDFS in detail. This user guide primarily deals with the interaction of users and administrators with HDFS clusters. The HDFS architecture diagram depicts basic interactions among NameNode, the DataNodes, and the clients. Clients contact NameNode for file metadata or file modifications and perform actual file I/O directly with the DataNodes.
The following are some of the salient features that could be of interest to many users.
* Hadoop, including HDFS, is well suited for distributed storage and
distributed processing using commodity hardware. It is fault
tolerant, scalable, and extremely simple to expand. MapReduce, well
known for its simplicity and applicability for large set of
distributed applications, is an integral part of Hadoop.
* HDFS is highly configurable with a default configuration well
suited for many installations. Most of the time, configuration
needs to be tuned only for very large clusters.
* Hadoop is written in Java and is supported on all major platforms.
* Hadoop supports shell-like commands to interact with HDFS directly.
* The NameNode and Datanodes have built in web servers that makes it
easy to check current status of the cluster.
* New features and improvements are regularly implemented in HDFS.
The following is a subset of useful features in HDFS:
* File permissions and authentication.
* Rack awareness: to take a node's physical location into
account while scheduling tasks and allocating storage.
* Safemode: an administrative mode for maintenance.
* `fsck`: a utility to diagnose health of the file system, to find
missing files or blocks.
* `fetchdt`: a utility to fetch DelegationToken and store it in a
file on the local system.
* Balancer: tool to balance the cluster when the data is
unevenly distributed among DataNodes.
* Upgrade and rollback: after a software upgrade, it is possible
to rollback to HDFS' state before the upgrade in case of unexpected problems.
* Secondary NameNode: performs periodic checkpoints of the
namespace and helps keep the size of file containing log of
HDFS modifications within certain limits at the NameNode.
* Checkpoint node: performs periodic checkpoints of the
namespace and helps minimize the size of the log stored at the
NameNode containing changes to the HDFS. Replaces the role
previously filled by the Secondary NameNode, though is not yet
battle hardened. The NameNode allows multiple Checkpoint nodes
simultaneously, as long as there are no Backup nodes
registered with the system.
* Backup node: An extension to the Checkpoint node. In addition
to checkpointing it also receives a stream of edits from the
NameNode and maintains its own in-memory copy of the
namespace, which is always in sync with the active NameNode
namespace state. Only one Backup node may be registered with
the NameNode at once.
Prerequisites
-------------
The following documents describe how to install and set up a Hadoop cluster:
* [Single Node Setup](../hadoop-common/SingleCluster.html) for first-time users.
* [Cluster Setup](../hadoop-common/ClusterSetup.html) for large, distributed clusters.
The rest of this document assumes the user is able to set up and run a HDFS with at least one DataNode. For the purpose of this document, both the NameNode and DataNode could be running on the same physical machine.
Web Interface
-------------
NameNode and DataNode each run an internal web server in order to display basic information about the current status of the cluster. With the default configuration, the NameNode front page is at `http://namenode-name:50070/`. It lists the DataNodes in the cluster and basic statistics of the cluster. The web interface can also be used to browse the file system (using "Browse the file system" link on the NameNode front page).
Shell Commands
--------------
Hadoop includes various shell-like commands that directly interact with HDFS and other file systems that Hadoop supports. The command `bin/hdfs dfs -help` lists the commands supported by Hadoop shell. Furthermore, the command `bin/hdfs dfs -help command-name` displays more detailed help for a command. These commands support most of the normal files system operations like copying files, changing file permissions, etc. It also supports a few HDFS specific operations like changing replication of files. For more information see [File System Shell Guide](../hadoop-common/FileSystemShell.html).
### DFSAdmin Command
The `bin/hdfs dfsadmin` command supports a few HDFS administration related operations. The `bin/hdfs dfsadmin -help` command lists all the commands currently supported. For e.g.:
* `-report`: reports basic statistics of HDFS. Some of this
information is also available on the NameNode front page.
* `-safemode`: though usually not required, an administrator can
manually enter or leave Safemode.
* `-finalizeUpgrade`: removes previous backup of the cluster made
during last upgrade.
* `-refreshNodes`: Updates the namenode with the set of datanodes
allowed to connect to the namenode. Namenodes re-read datanode
hostnames in the file defined by `dfs.hosts`, `dfs.hosts.exclude`
Hosts defined in `dfs.hosts` are the datanodes that are part of the
cluster. If there are entries in `dfs.hosts`, only the hosts in it
are allowed to register with the namenode. Entries in
`dfs.hosts.exclude` are datanodes that need to be decommissioned.
Datanodes complete decommissioning when all the replicas from them
are replicated to other datanodes. Decommissioned nodes are not
automatically shutdown and are not chosen for writing for new
replicas.
* `-printTopology` : Print the topology of the cluster. Display a tree
of racks and datanodes attached to the tracks as viewed by the
NameNode.
For command usage, see [dfsadmin](./HDFSCommands.html#dfsadmin).
Secondary NameNode
------------------
The NameNode stores modifications to the file system as a log appended to a native file system file, edits. When a NameNode starts up, it reads HDFS state from an image file, fsimage, and then applies edits from the edits log file. It then writes new HDFS state to the fsimage and starts normal operation with an empty edits file. Since NameNode merges fsimage and edits files only during start up, the edits log file could get very large over time on a busy cluster. Another side effect of a larger edits file is that next restart of NameNode takes longer.
The secondary NameNode merges the fsimage and the edits log files periodically and keeps edits log size within a limit. It is usually run on a different machine than the primary NameNode since its memory requirements are on the same order as the primary NameNode.
The start of the checkpoint process on the secondary NameNode is controlled by two configuration parameters.
* `dfs.namenode.checkpoint.period`, set to 1 hour by default, specifies
the maximum delay between two consecutive checkpoints, and
* `dfs.namenode.checkpoint.txns`, set to 1 million by default, defines the
number of uncheckpointed transactions on the NameNode which will
force an urgent checkpoint, even if the checkpoint period has not
been reached.
The secondary NameNode stores the latest checkpoint in a directory which is structured the same way as the primary NameNode's directory. So that the check pointed image is always ready to be read by the primary NameNode if necessary.
For command usage, see [secondarynamenode](./HDFSCommands.html#secondarynamenode).
Checkpoint Node
---------------
NameNode persists its namespace using two files: fsimage, which is the latest checkpoint of the namespace and edits, a journal (log) of changes to the namespace since the checkpoint. When a NameNode starts up, it merges the fsimage and edits journal to provide an up-to-date view of the file system metadata. The NameNode then overwrites fsimage with the new HDFS state and begins a new edits journal.
The Checkpoint node periodically creates checkpoints of the namespace. It downloads fsimage and edits from the active NameNode, merges them locally, and uploads the new image back to the active NameNode. The Checkpoint node usually runs on a different machine than the NameNode since its memory requirements are on the same order as the NameNode. The Checkpoint node is started by bin/hdfs namenode -checkpoint on the node specified in the configuration file.
The location of the Checkpoint (or Backup) node and its accompanying web interface are configured via the `dfs.namenode.backup.address` and `dfs.namenode.backup.http-address` configuration variables.
The start of the checkpoint process on the Checkpoint node is controlled by two configuration parameters.
* `dfs.namenode.checkpoint.period`, set to 1 hour by default, specifies
the maximum delay between two consecutive checkpoints
* `dfs.namenode.checkpoint.txns`, set to 1 million by default, defines the
number of uncheckpointed transactions on the NameNode which will
force an urgent checkpoint, even if the checkpoint period has not
been reached.
The Checkpoint node stores the latest checkpoint in a directory that is structured the same as the NameNode's directory. This allows the checkpointed image to be always available for reading by the NameNode if necessary. See Import checkpoint.
Multiple checkpoint nodes may be specified in the cluster configuration file.
For command usage, see [namenode](./HDFSCommands.html#namenode).
Backup Node
-----------
The Backup node provides the same checkpointing functionality as the Checkpoint node, as well as maintaining an in-memory, up-to-date copy of the file system namespace that is always synchronized with the active NameNode state. Along with accepting a journal stream of file system edits from the NameNode and persisting this to disk, the Backup node also applies those edits into its own copy of the namespace in memory, thus creating a backup of the namespace.
The Backup node does not need to download fsimage and edits files from the active NameNode in order to create a checkpoint, as would be required with a Checkpoint node or Secondary NameNode, since it already has an up-to-date state of the namespace state in memory. The Backup node checkpoint process is more efficient as it only needs to save the namespace into the local fsimage file and reset edits.
As the Backup node maintains a copy of the namespace in memory, its RAM requirements are the same as the NameNode.
The NameNode supports one Backup node at a time. No Checkpoint nodes may be registered if a Backup node is in use. Using multiple Backup nodes concurrently will be supported in the future.
The Backup node is configured in the same manner as the Checkpoint node. It is started with `bin/hdfs namenode -backup`.
The location of the Backup (or Checkpoint) node and its accompanying web interface are configured via the `dfs.namenode.backup.address` and `dfs.namenode.backup.http-address` configuration variables.
Use of a Backup node provides the option of running the NameNode with no persistent storage, delegating all responsibility for persisting the state of the namespace to the Backup node. To do this, start the NameNode with the `-importCheckpoint` option, along with specifying no persistent storage directories of type edits `dfs.namenode.edits.dir` for the NameNode configuration.
For a complete discussion of the motivation behind the creation of the Backup node and Checkpoint node, see [HADOOP-4539](https://issues.apache.org/jira/browse/HADOOP-4539). For command usage, see [namenode](./HDFSCommands.html#namenode).
Import Checkpoint
-----------------
The latest checkpoint can be imported to the NameNode if all other copies of the image and the edits files are lost. In order to do that one should:
* Create an empty directory specified in the `dfs.namenode.name.dir`
configuration variable;
* Specify the location of the checkpoint directory in the
configuration variable `dfs.namenode.checkpoint.dir`;
* and start the NameNode with `-importCheckpoint` option.
The NameNode will upload the checkpoint from the `dfs.namenode.checkpoint.dir` directory and then save it to the NameNode directory(s) set in `dfs.namenode.name.dir`. The NameNode will fail if a legal image is contained in `dfs.namenode.name.dir`. The NameNode verifies that the image in `dfs.namenode.checkpoint.dir` is consistent, but does not modify it in any way.
For command usage, see [namenode](./HDFSCommands.html#namenode).
Balancer
--------
HDFS data might not always be be placed uniformly across the DataNode. One common reason is addition of new DataNodes to an existing cluster. While placing new blocks (data for a file is stored as a series of blocks), NameNode considers various parameters before choosing the DataNodes to receive these blocks. Some of the considerations are:
* Policy to keep one of the replicas of a block on the same node as
the node that is writing the block.
* Need to spread different replicas of a block across the racks so
that cluster can survive loss of whole rack.
* One of the replicas is usually placed on the same rack as the node
writing to the file so that cross-rack network I/O is reduced.
* Spread HDFS data uniformly across the DataNodes in the cluster.
Due to multiple competing considerations, data might not be uniformly placed across the DataNodes. HDFS provides a tool for administrators that analyzes block placement and rebalanaces data across the DataNode. A brief administrator's guide for balancer is available at [HADOOP-1652](https://issues.apache.org/jira/browse/HADOOP-1652).
For command usage, see [balancer](./HDFSCommands.html#balancer).
Rack Awareness
--------------
Typically large Hadoop clusters are arranged in racks and network traffic between different nodes with in the same rack is much more desirable than network traffic across the racks. In addition NameNode tries to place replicas of block on multiple racks for improved fault tolerance. Hadoop lets the cluster administrators decide which rack a node belongs to through configuration variable `net.topology.script.file.name`. When this script is configured, each node runs the script to determine its rack id. A default installation assumes all the nodes belong to the same rack. This feature and configuration is further described in PDF attached to [HADOOP-692](https://issues.apache.org/jira/browse/HADOOP-692).
Safemode
--------
During start up the NameNode loads the file system state from the fsimage and the edits log file. It then waits for DataNodes to report their blocks so that it does not prematurely start replicating the blocks though enough replicas already exist in the cluster. During this time NameNode stays in Safemode. Safemode for the NameNode is essentially a read-only mode for the HDFS cluster, where it does not allow any modifications to file system or blocks. Normally the NameNode leaves Safemode automatically after the DataNodes have reported that most file system blocks are available. If required, HDFS could be placed in Safemode explicitly using `bin/hdfs dfsadmin -safemode` command. NameNode front page shows whether Safemode is on or off. A more detailed description and configuration is maintained as JavaDoc for `setSafeMode()`.
fsck
----
HDFS supports the fsck command to check for various inconsistencies. It it is designed for reporting problems with various files, for example, missing blocks for a file or under-replicated blocks. Unlike a traditional fsck utility for native file systems, this command does not correct the errors it detects. Normally NameNode automatically corrects most of the recoverable failures. By default fsck ignores open files but provides an option to select all files during reporting. The HDFS fsck command is not a Hadoop shell command. It can be run as `bin/hdfs fsck`. For command usage, see [fsck](./HDFSCommands.html#fsck). fsck can be run on the whole file system or on a subset of files.
fetchdt
-------
HDFS supports the fetchdt command to fetch Delegation Token and store it in a file on the local system. This token can be later used to access secure server (NameNode for example) from a non secure client. Utility uses either RPC or HTTPS (over Kerberos) to get the token, and thus requires kerberos tickets to be present before the run (run kinit to get the tickets). The HDFS fetchdt command is not a Hadoop shell command. It can be run as `bin/hdfs fetchdt DTfile`. After you got the token you can run an HDFS command without having Kerberos tickets, by pointing `HADOOP_TOKEN_FILE_LOCATION` environmental variable to the delegation token file. For command usage, see [fetchdt](./HDFSCommands.html#fetchdt) command.
Recovery Mode
-------------
Typically, you will configure multiple metadata storage locations. Then, if one storage location is corrupt, you can read the metadata from one of the other storage locations.
However, what can you do if the only storage locations available are corrupt? In this case, there is a special NameNode startup mode called Recovery mode that may allow you to recover most of your data.
You can start the NameNode in recovery mode like so: `namenode -recover`
When in recovery mode, the NameNode will interactively prompt you at the command line about possible courses of action you can take to recover your data.
If you don't want to be prompted, you can give the `-force` option. This option will force recovery mode to always select the first choice. Normally, this will be the most reasonable choice.
Because Recovery mode can cause you to lose data, you should always back up your edit log and fsimage before using it.
Upgrade and Rollback
--------------------
When Hadoop is upgraded on an existing cluster, as with any software upgrade, it is possible there are new bugs or incompatible changes that affect existing applications and were not discovered earlier. In any non-trivial HDFS installation, it is not an option to loose any data, let alone to restart HDFS from scratch. HDFS allows administrators to go back to earlier version of Hadoop and rollback the cluster to the state it was in before the upgrade. HDFS upgrade is described in more detail in [Hadoop Upgrade](http://wiki.apache.org/hadoop/Hadoop_Upgrade) Wiki page. HDFS can have one such backup at a time. Before upgrading, administrators need to remove existing backup using bin/hadoop dfsadmin `-finalizeUpgrade` command. The following briefly describes the typical upgrade procedure:
* Before upgrading Hadoop software, finalize if there an existing
backup. `dfsadmin -upgradeProgress` status can tell if the cluster
needs to be finalized.
* Stop the cluster and distribute new version of Hadoop.
* Run the new version with `-upgrade` option (`bin/start-dfs.sh -upgrade`).
* Most of the time, cluster works just fine. Once the new HDFS is
considered working well (may be after a few days of operation),
finalize the upgrade. Note that until the cluster is finalized,
deleting the files that existed before the upgrade does not free up
real disk space on the DataNodes.
* If there is a need to move back to the old version,
* stop the cluster and distribute earlier version of Hadoop.
* start the cluster with rollback option. (`bin/start-dfs.sh -rollback`).
When upgrading to a new version of HDFS, it is necessary to rename or delete any paths that are reserved in the new version of HDFS. If the NameNode encounters a reserved path during upgrade, it will print an error like the following:
` /.reserved is a reserved path and .snapshot is a reserved path component in this version of HDFS. Please rollback and delete or rename this path, or upgrade with the -renameReserved [key-value pairs] option to automatically rename these paths during upgrade.`
Specifying `-upgrade -renameReserved [optional key-value pairs]` causes the NameNode to automatically rename any reserved paths found during startup. For example, to rename all paths named `.snapshot` to `.my-snapshot` and `.reserved` to `.my-reserved`, a user would specify `-upgrade -renameReserved .snapshot=.my-snapshot,.reserved=.my-reserved`.
If no key-value pairs are specified with `-renameReserved`, the NameNode will then suffix reserved paths with `.<LAYOUT-VERSION>.UPGRADE_RENAMED`, e.g. `.snapshot.-51.UPGRADE_RENAMED`.
There are some caveats to this renaming process. It's recommended, if possible, to first `hdfs dfsadmin -saveNamespace` before upgrading. This is because data inconsistency can result if an edit log operation refers to the destination of an automatically renamed file.
DataNode Hot Swap Drive
-----------------------
Datanode supports hot swappable drives. The user can add or replace HDFS data volumes without shutting down the DataNode. The following briefly describes the typical hot swapping drive procedure:
* If there are new storage directories, the user should format them and mount them
appropriately.
* The user updates the DataNode configuration `dfs.datanode.data.dir`
to reflect the data volume directories that will be actively in use.
* The user runs `dfsadmin -reconfig datanode HOST:PORT start` to start
the reconfiguration process. The user can use
`dfsadmin -reconfig datanode HOST:PORT status`
to query the running status of the reconfiguration task.
* Once the reconfiguration task has completed, the user can safely `umount`
the removed data volume directories and physically remove the disks.
File Permissions and Security
-----------------------------
The file permissions are designed to be similar to file permissions on other familiar platforms like Linux. Currently, security is limited to simple file permissions. The user that starts NameNode is treated as the superuser for HDFS. Future versions of HDFS will support network authentication protocols like Kerberos for user authentication and encryption of data transfers. The details are discussed in the Permissions Guide.
Scalability
-----------
Hadoop currently runs on clusters with thousands of nodes. The [PoweredBy](http://wiki.apache.org/hadoop/PoweredBy) Wiki page lists some of the organizations that deploy Hadoop on large clusters. HDFS has one NameNode for each cluster. Currently the total memory available on NameNode is the primary scalability limitation. On very large clusters, increasing average size of files stored in HDFS helps with increasing cluster size without increasing memory requirements on NameNode. The default configuration may not suite very large clusters. The [FAQ](http://wiki.apache.org/hadoop/FAQ) Wiki page lists suggested configuration improvements for large Hadoop clusters.
Related Documentation
---------------------
This user guide is a good starting point for working with HDFS. While the user guide continues to improve, there is a large wealth of documentation about Hadoop and HDFS. The following list is a starting point for further exploration:
* [Hadoop Site](http://hadoop.apache.org): The home page for the Apache Hadoop site.
* [Hadoop Wiki](http://wiki.apache.org/hadoop/FrontPage): The home page (FrontPage) for the Hadoop Wiki. Unlike the released documentation, which is part of Hadoop source tree, Hadoop Wiki is regularly edited by Hadoop Community.
* [FAQ](http://wiki.apache.org/hadoop/FAQ): The FAQ Wiki page.
* [Hadoop JavaDoc API](../../api/index.html).
* Hadoop User Mailing List: user[at]hadoop.apache.org.
* Explore [hdfs-default.xml](./hdfs-default.xml). It includes brief description of most of the configuration variables available.
* [HDFS Commands Guide](./HDFSCommands.html): HDFS commands usage.

View File

@ -0,0 +1,92 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
C API libhdfs
=============
* [C API libhdfs](#C_API_libhdfs)
* [Overview](#Overview)
* [The APIs](#The_APIs)
* [A Sample Program](#A_Sample_Program)
* [How To Link With The Library](#How_To_Link_With_The_Library)
* [Common Problems](#Common_Problems)
* [Thread Safe](#Thread_Safe)
Overview
--------
libhdfs is a JNI based C API for Hadoop's Distributed File System (HDFS). It provides C APIs to a subset of the HDFS APIs to manipulate HDFS files and the filesystem. libhdfs is part of the Hadoop distribution and comes pre-compiled in `$HADOOP_HDFS_HOME/lib/native/libhdfs.so` . libhdfs is compatible with Windows and can be built on Windows by running `mvn compile` within the `hadoop-hdfs-project/hadoop-hdfs` directory of the source tree.
The APIs
--------
The libhdfs APIs are a subset of the [Hadoop FileSystem APIs](../../api/org/apache/hadoop/fs/FileSystem.html).
The header file for libhdfs describes each API in detail and is available in `$HADOOP_HDFS_HOME/include/hdfs.h`.
A Sample Program
----------------
```c
#include "hdfs.h"
int main(int argc, char **argv) {
hdfsFS fs = hdfsConnect("default", 0);
const char* writePath = "/tmp/testfile.txt";
hdfsFile writeFile = hdfsOpenFile(fs, writePath, O_WRONLY |O_CREAT, 0, 0, 0);
if(!writeFile) {
fprintf(stderr, "Failed to open %s for writing!\n", writePath);
exit(-1);
}
char* buffer = "Hello, World!";
tSize num_written_bytes = hdfsWrite(fs, writeFile, (void*)buffer, strlen(buffer)+1);
if (hdfsFlush(fs, writeFile)) {
fprintf(stderr, "Failed to 'flush' %s\n", writePath);
exit(-1);
}
hdfsCloseFile(fs, writeFile);
}
```
How To Link With The Library
----------------------------
See the CMake file for `test_libhdfs_ops.c` in the libhdfs source directory (`hadoop-hdfs-project/hadoop-hdfs/src/CMakeLists.txt`) or something like: `gcc above_sample.c -I$HADOOP_HDFS_HOME/include -L$HADOOP_HDFS_HOME/lib/native -lhdfs -o above_sample`
Common Problems
---------------
The most common problem is the `CLASSPATH` is not set properly when calling a program that uses libhdfs. Make sure you set it to all the Hadoop jars needed to run Hadoop itself as well as the right configuration directory containing `hdfs-site.xml`. It is not valid to use wildcard syntax for specifying multiple jars. It may be useful to run `hadoop classpath --glob` or `hadoop classpath --jar <path`\> to generate the correct classpath for your deployment. See [Hadoop Commands Reference](../hadoop-common/CommandsManual.html#classpath) for more information on this command.
Thread Safe
-----------
libdhfs is thread safe.
* Concurrency and Hadoop FS "handles"
The Hadoop FS implementation includes a FS handle cache which
caches based on the URI of the namenode along with the user
connecting. So, all calls to `hdfsConnect` will return the same
handle but calls to `hdfsConnectAsUser` with different users will
return different handles. But, since HDFS client handles are
completely thread safe, this has no bearing on concurrency.
* Concurrency and libhdfs/JNI
The libhdfs calls to JNI should always be creating thread local
storage, so (in theory), libhdfs should be as thread safe as the
underlying calls to the Hadoop FS.

View File

@ -0,0 +1,157 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
Synthetic Load Generator Guide
==============================
* [Synthetic Load Generator Guide](#Synthetic_Load_Generator_Guide)
* [Overview](#Overview)
* [Synopsis](#Synopsis)
* [Test Space Population](#Test_Space_Population)
* [Structure Generator](#Structure_Generator)
* [Data Generator](#Data_Generator)
Overview
--------
The synthetic load generator (SLG) is a tool for testing NameNode behavior under different client loads. The user can generate different mixes of read, write, and list requests by specifying the probabilities of read and write. The user controls the intensity of the load by adjusting parameters for the number of worker threads and the delay between operations. While load generators are running, the user can profile and monitor the running of the NameNode. When a load generator exits, it prints some NameNode statistics like the average execution time of each kind of operation and the NameNode throughput.
Synopsis
--------
The synopsis of the command is:
java LoadGenerator [options]
Options include:
* `-readProbability` *read probability*
The probability of the read operation; default is 0.3333.
* `-writeProbability` *write probability*
The probability of the write operations; default is 0.3333.
* `-root` *test space root*
The root of the test space; default is /testLoadSpace.
* `-maxDelayBetweenOps` *maxDelayBetweenOpsInMillis*
The maximum delay between two consecutive operations in a thread;
default is 0 indicating no delay.
* `-numOfThreads` *numOfThreads*
The number of threads to spawn; default is 200.
* `-elapsedTime` *elapsedTimeInSecs*
The number of seconds that the program will run; A value of zero
indicates that the program runs forever. The default value is 0.
* `-startTime` *startTimeInMillis*
The time that all worker threads start to run. By default it is 10
seconds after the main program starts running.This creates a
barrier if more than one load generator is running.
* `-seed` *seed*
The random generator seed for repeating requests to NameNode when
running with a single thread; default is the current time.
After command line argument parsing, the load generator traverses the test space and builds a table of all directories and another table of all files in the test space. It then waits until the start time to spawn the number of worker threads as specified by the user. Each thread sends a stream of requests to NameNode. At each iteration, it first decides if it is going to read a file, create a file, or list a directory following the read and write probabilities specified by the user. The listing probability is equal to 1-read probability-write probability. When reading, it randomly picks a file in the test space and reads the entire file. When writing, it randomly picks a directory in the test space and creates a file there.
To avoid two threads with the same load generator or from two different load generators creating the same file, the file name consists of the current machine's host name and the thread id. The length of the file follows Gaussian distribution with an average size of 2 blocks and the standard deviation of 1. The new file is filled with byte 'a'. To avoid the test space growing indefinitely, the file is deleted immediately after the file creation completes. While listing, it randomly picks a directory in the test space and lists its content.
After an operation completes, the thread pauses for a random amount of time in the range of [0, maxDelayBetweenOps] if the specified maximum delay is not zero. All threads are stopped when the specified elapsed time is passed. Before exiting, the program prints the average execution for each kind of NameNode operations, and the number of requests served by the NameNode per second.
Test Space Population
---------------------
The user needs to populate a test space before running a load generator. The structure generator generates a random test space structure and the data generator creates the files and directories of the test space in Hadoop distributed file system.
### Structure Generator
This tool generates a random namespace structure with the following constraints:
1. The number of subdirectories that a directory can have is a random
number in [minWidth, maxWidth].
2. The maximum depth of each subdirectory is a random number
[2\*maxDepth/3, maxDepth].
3. Files are randomly placed in leaf directories. The size of each
file follows Gaussian distribution with an average size of 1 block
and a standard deviation of 1.
The generated namespace structure is described by two files in the output directory. Each line of the first file contains the full name of a leaf directory. Each line of the second file contains the full name of a file and its size, separated by a blank.
The synopsis of the command is:
java StructureGenerator [options]
Options include:
* `-maxDepth` *maxDepth*
Maximum depth of the directory tree; default is 5.
* `-minWidth` *minWidth*
Minimum number of subdirectories per directories; default is 1.
* `-maxWidth` *maxWidth*
Maximum number of subdirectories per directories; default is 5.
* `-numOfFiles` *\#OfFiles*
The total number of files in the test space; default is 10.
* `-avgFileSize` *avgFileSizeInBlocks*
Average size of blocks; default is 1.
* `-outDir` *outDir*
Output directory; default is the current directory.
* `-seed` *seed*
Random number generator seed; default is the current time.
### Data Generator
This tool reads the directory structure and file structure from the input directory and creates the namespace in Hadoop distributed file system. All files are filled with byte 'a'.
The synopsis of the command is:
java DataGenerator [options]
Options include:
* `-inDir` *inDir*
Input directory name where directory/file structures are stored;
default is the current directory.
* `-root` *test space root*
The name of the root directory which the new namespace is going to
be placed under; default is "/testLoadSpace".

View File

@ -0,0 +1,87 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
HDFS Short-Circuit Local Reads
==============================
* [HDFS Short-Circuit Local Reads](#HDFS_Short-Circuit_Local_Reads)
* [Short-Circuit Local Reads](#Short-Circuit_Local_Reads)
* [Background](#Background)
* [Setup](#Setup)
* [Example Configuration](#Example_Configuration)
* [Legacy HDFS Short-Circuit Local Reads](#Legacy_HDFS_Short-Circuit_Local_Reads)
Short-Circuit Local Reads
-------------------------
### Background
In `HDFS`, reads normally go through the `DataNode`. Thus, when the client asks the `DataNode` to read a file, the `DataNode` reads that file off of the disk and sends the data to the client over a TCP socket. So-called "short-circuit" reads bypass the `DataNode`, allowing the client to read the file directly. Obviously, this is only possible in cases where the client is co-located with the data. Short-circuit reads provide a substantial performance boost to many applications.
### Setup
To configure short-circuit local reads, you will need to enable `libhadoop.so`. See [Native Libraries](../hadoop-common/NativeLibraries.html) for details on enabling this library.
Short-circuit reads make use of a UNIX domain socket. This is a special path in the filesystem that allows the client and the `DataNode`s to communicate. You will need to set a path to this socket. The `DataNode` needs to be able to create this path. On the other hand, it should not be possible for any user except the HDFS user or root to create this path. For this reason, paths under `/var/run` or `/var/lib` are often used.
The client and the `DataNode` exchange information via a shared memory segment on `/dev/shm`.
Short-circuit local reads need to be configured on both the `DataNode` and the client.
### Example Configuration
Here is an example configuration.
```xml
<configuration>
<property>
<name>dfs.client.read.shortcircuit</name>
<value>true</value>
</property>
<property>
<name>dfs.domain.socket.path</name>
<value>/var/lib/hadoop-hdfs/dn_socket</value>
</property>
</configuration>
```
Legacy HDFS Short-Circuit Local Reads
-------------------------------------
Legacy implementation of short-circuit local reads on which the clients directly open the HDFS block files is still available for platforms other than the Linux. Setting the value of `dfs.client.use.legacy.blockreader.local` in addition to `dfs.client.read.shortcircuit` to true enables this feature.
You also need to set the value of `dfs.datanode.data.dir.perm` to `750` instead of the default `700` and chmod/chown the directory tree under `dfs.datanode.data.dir` as readable to the client and the `DataNode`. You must take caution because this means that the client can read all of the block files bypassing HDFS permission.
Because Legacy short-circuit local reads is insecure, access to this feature is limited to the users listed in the value of `dfs.block.local-path-access.user`.
```xml
<configuration>
<property>
<name>dfs.client.read.shortcircuit</name>
<value>true</value>
</property>
<property>
<name>dfs.client.use.legacy.blockreader.local</name>
<value>true</value>
</property>
<property>
<name>dfs.datanode.data.dir.perm</name>
<value>750</value>
</property>
<property>
<name>dfs.block.local-path-access.user</name>
<value>foo,bar</value>
</property>
</configuration>
```

View File

@ -0,0 +1,268 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
Transparent Encryption in HDFS
==============================
* [Overview](#Overview)
* [Background](#Background)
* [Use Cases](#Use_Cases)
* [Architecture](#Architecture)
* [Overview](#Overview)
* [Accessing data within an encryption zone](#Accessing_data_within_an_encryption_zone)
* [Key Management Server, KeyProvider, EDEKs](#Key_Management_Server_KeyProvider_EDEKs)
* [Configuration](#Configuration)
* [Configuring the cluster KeyProvider](#Configuring_the_cluster_KeyProvider)
* [Selecting an encryption algorithm and codec](#Selecting_an_encryption_algorithm_and_codec)
* [Namenode configuration](#Namenode_configuration)
* [crypto command-line interface](#crypto_command-line_interface)
* [createZone](#createZone)
* [listZones](#listZones)
* [Example usage](#Example_usage)
* [Distcp considerations](#Distcp_considerations)
* [Running as the superuser](#Running_as_the_superuser)
* [Copying between encrypted and unencrypted locations](#Copying_between_encrypted_and_unencrypted_locations)
* [Attack vectors](#Attack_vectors)
* [Hardware access exploits](#Hardware_access_exploits)
* [Root access exploits](#Root_access_exploits)
* [HDFS admin exploits](#HDFS_admin_exploits)
* [Rogue user exploits](#Rogue_user_exploits)
Overview
--------
HDFS implements *transparent*, *end-to-end* encryption. Once configured, data read from and written to special HDFS directories is *transparently* encrypted and decrypted without requiring changes to user application code. This encryption is also *end-to-end*, which means the data can only be encrypted and decrypted by the client. HDFS never stores or has access to unencrypted data or unencrypted data encryption keys. This satisfies two typical requirements for encryption: *at-rest encryption* (meaning data on persistent media, such as a disk) as well as *in-transit encryption* (e.g. when data is travelling over the network).
Background
----------
Encryption can be done at different layers in a traditional data management software/hardware stack. Choosing to encrypt at a given layer comes with different advantages and disadvantages.
* **Application-level encryption**. This is the most secure and most flexible approach. The application has ultimate control over what is encrypted and can precisely reflect the requirements of the user. However, writing applications to do this is hard. This is also not an option for customers of existing applications that do not support encryption.
* **Database-level encryption**. Similar to application-level encryption in terms of its properties. Most database vendors offer some form of encryption. However, there can be performance issues. One example is that indexes cannot be encrypted.
* **Filesystem-level encryption**. This option offers high performance, application transparency, and is typically easy to deploy. However, it is unable to model some application-level policies. For instance, multi-tenant applications might want to encrypt based on the end user. A database might want different encryption settings for each column stored within a single file.
* **Disk-level encryption**. Easy to deploy and high performance, but also quite inflexible. Only really protects against physical theft.
HDFS-level encryption fits between database-level and filesystem-level encryption in this stack. This has a lot of positive effects. HDFS encryption is able to provide good performance and existing Hadoop applications are able to run transparently on encrypted data. HDFS also has more context than traditional filesystems when it comes to making policy decisions.
HDFS-level encryption also prevents attacks at the filesystem-level and below (so-called "OS-level attacks"). The operating system and disk only interact with encrypted bytes, since the data is already encrypted by HDFS.
Use Cases
---------
Data encryption is required by a number of different government, financial, and regulatory entities. For example, the health-care industry has HIPAA regulations, the card payment industry has PCI DSS regulations, and the US government has FISMA regulations. Having transparent encryption built into HDFS makes it easier for organizations to comply with these regulations.
Encryption can also be performed at the application-level, but by integrating it into HDFS, existing applications can operate on encrypted data without changes. This integrated architecture implies stronger encrypted file semantics and better coordination with other HDFS functions.
Architecture
------------
### Overview
For transparent encryption, we introduce a new abstraction to HDFS: the *encryption zone*. An encryption zone is a special directory whose contents will be transparently encrypted upon write and transparently decrypted upon read. Each encryption zone is associated with a single *encryption zone key* which is specified when the zone is created. Each file within an encryption zone has its own unique *data encryption key (DEK)*. DEKs are never handled directly by HDFS. Instead, HDFS only ever handles an *encrypted data encryption key (EDEK)*. Clients decrypt an EDEK, and then use the subsequent DEK to read and write data. HDFS datanodes simply see a stream of encrypted bytes.
A new cluster service is required to manage encryption keys: the Hadoop Key Management Server (KMS). In the context of HDFS encryption, the KMS performs three basic responsibilities:
1. Providing access to stored encryption zone keys
2. Generating new encrypted data encryption keys for storage on the NameNode
3. Decrypting encrypted data encryption keys for use by HDFS clients
The KMS will be described in more detail below.
### Accessing data within an encryption zone
When creating a new file in an encryption zone, the NameNode asks the KMS to generate a new EDEK encrypted with the encryption zone's key. The EDEK is then stored persistently as part of the file's metadata on the NameNode.
When reading a file within an encryption zone, the NameNode provides the client with the file's EDEK and the encryption zone key version used to encrypt the EDEK. The client then asks the KMS to decrypt the EDEK, which involves checking that the client has permission to access the encryption zone key version. Assuming that is successful, the client uses the DEK to decrypt the file's contents.
All of the above steps for the read and write path happen automatically through interactions between the DFSClient, the NameNode, and the KMS.
Access to encrypted file data and metadata is controlled by normal HDFS filesystem permissions. This means that if HDFS is compromised (for example, by gaining unauthorized access to an HDFS superuser account), a malicious user only gains access to ciphertext and encrypted keys. However, since access to encryption zone keys is controlled by a separate set of permissions on the KMS and key store, this does not pose a security threat.
### Key Management Server, KeyProvider, EDEKs
The KMS is a proxy that interfaces with a backing key store on behalf of HDFS daemons and clients. Both the backing key store and the KMS implement the Hadoop KeyProvider API. See the [KMS documentation](../../hadoop-kms/index.html) for more information.
In the KeyProvider API, each encryption key has a unique *key name*. Because keys can be rolled, a key can have multiple *key versions*, where each key version has its own *key material* (the actual secret bytes used during encryption and decryption). An encryption key can be fetched by either its key name, returning the latest version of the key, or by a specific key version.
The KMS implements additional functionality which enables creation and decryption of *encrypted encryption keys (EEKs)*. Creation and decryption of EEKs happens entirely on the KMS. Importantly, the client requesting creation or decryption of an EEK never handles the EEK's encryption key. To create a new EEK, the KMS generates a new random key, encrypts it with the specified key, and returns the EEK to the client. To decrypt an EEK, the KMS checks that the user has access to the encryption key, uses it to decrypt the EEK, and returns the decrypted encryption key.
In the context of HDFS encryption, EEKs are *encrypted data encryption keys (EDEKs)*, where a *data encryption key (DEK)* is what is used to encrypt and decrypt file data. Typically, the key store is configured to only allow end users access to the keys used to encrypt DEKs. This means that EDEKs can be safely stored and handled by HDFS, since the HDFS user will not have access to unencrypted encryption keys.
Configuration
-------------
A necessary prerequisite is an instance of the KMS, as well as a backing key store for the KMS. See the [KMS documentation](../../hadoop-kms/index.html) for more information.
Once a KMS has been set up and the NameNode and HDFS clients have been correctly configured, an admin can use the `hadoop key` and `hdfs crypto` command-line tools to create encryption keys and set up new encryption zones. Existing data can be encrypted by copying it into the new encryption zones using tools like distcp.
### Configuring the cluster KeyProvider
#### dfs.encryption.key.provider.uri
The KeyProvider to use when interacting with encryption keys used when reading and writing to an encryption zone.
### Selecting an encryption algorithm and codec
#### hadoop.security.crypto.codec.classes.EXAMPLECIPHERSUITE
The prefix for a given crypto codec, contains a comma-separated list of implementation classes for a given crypto codec (eg EXAMPLECIPHERSUITE). The first implementation will be used if available, others are fallbacks.
#### hadoop.security.crypto.codec.classes.aes.ctr.nopadding
Default: `org.apache.hadoop.crypto.OpensslAesCtrCryptoCodec,org.apache.hadoop.crypto.JceAesCtrCryptoCodec`
Comma-separated list of crypto codec implementations for AES/CTR/NoPadding. The first implementation will be used if available, others are fallbacks.
#### hadoop.security.crypto.cipher.suite
Default: `AES/CTR/NoPadding`
Cipher suite for crypto codec.
#### hadoop.security.crypto.jce.provider
Default: None
The JCE provider name used in CryptoCodec.
#### hadoop.security.crypto.buffer.size
Default: `8192`
The buffer size used by CryptoInputStream and CryptoOutputStream.
### Namenode configuration
#### dfs.namenode.list.encryption.zones.num.responses
Default: `100`
When listing encryption zones, the maximum number of zones that will be returned in a batch. Fetching the list incrementally in batches improves namenode performance.
`crypto` command-line interface
-------------------------------
### createZone
Usage: `[-createZone -keyName <keyName> -path <path>]`
Create a new encryption zone.
| | |
|:---- |:---- |
| *path* | The path of the encryption zone to create. It must be an empty directory. |
| *keyName* | Name of the key to use for the encryption zone. |
### listZones
Usage: `[-listZones]`
List all encryption zones. Requires superuser permissions.
Example usage
-------------
These instructions assume that you are running as the normal user or HDFS superuser as is appropriate. Use `sudo` as needed for your environment.
# As the normal user, create a new encryption key
hadoop key create myKey
# As the super user, create a new empty directory and make it an encryption zone
hadoop fs -mkdir /zone
hdfs crypto -createZone -keyName myKey -path /zone
# chown it to the normal user
hadoop fs -chown myuser:myuser /zone
# As the normal user, put a file in, read it out
hadoop fs -put helloWorld /zone
hadoop fs -cat /zone/helloWorld
Distcp considerations
---------------------
### Running as the superuser
One common usecase for distcp is to replicate data between clusters for backup and disaster recovery purposes. This is typically performed by the cluster administrator, who is an HDFS superuser.
To enable this same workflow when using HDFS encryption, we introduced a new virtual path prefix, `/.reserved/raw/`, that gives superusers direct access to the underlying block data in the filesystem. This allows superusers to distcp data without needing having access to encryption keys, and also avoids the overhead of decrypting and re-encrypting data. It also means the source and destination data will be byte-for-byte identical, which would not be true if the data was being re-encrypted with a new EDEK.
When using `/.reserved/raw` to distcp encrypted data, it's important to preserve extended attributes with the [-px](#a-px) flag. This is because encrypted file attributes (such as the EDEK) are exposed through extended attributes within `/.reserved/raw`, and must be preserved to be able to decrypt the file. This means that if the distcp is initiated at or above the encryption zone root, it will automatically create an encryption zone at the destination if it does not already exist. However, it's still recommended that the admin first create identical encryption zones on the destination cluster to avoid any potential mishaps.
### Copying between encrypted and unencrypted locations
By default, distcp compares checksums provided by the filesystem to verify that the data was successfully copied to the destination. When copying between an unencrypted and encrypted location, the filesystem checksums will not match since the underlying block data is different. In this case, specify the [-skipcrccheck](#a-skipcrccheck) and [-update](#a-update) distcp flags to avoid verifying checksums.
Attack vectors
--------------
### Hardware access exploits
These exploits assume that attacker has gained physical access to hard drives from cluster machines, i.e. datanodes and namenodes.
1. Access to swap files of processes containing data encryption keys.
* By itself, this does not expose cleartext, as it also requires access to encrypted block files.
* This can be mitigated by disabling swap, using encrypted swap, or using mlock to prevent keys from being swapped out.
2. Access to encrypted block files.
* By itself, this does not expose cleartext, as it also requires access to DEKs.
### Root access exploits
These exploits assume that attacker has gained root shell access to cluster machines, i.e. datanodes and namenodes. Many of these exploits cannot be addressed in HDFS, since a malicious root user has access to the in-memory state of processes holding encryption keys and cleartext. For these exploits, the only mitigation technique is carefully restricting and monitoring root shell access.
1. Access to encrypted block files.
* By itself, this does not expose cleartext, as it also requires access to encryption keys.
2. Dump memory of client processes to obtain DEKs, delegation tokens, cleartext.
* No mitigation.
3. Recording network traffic to sniff encryption keys and encrypted data in transit.
* By itself, insufficient to read cleartext without the EDEK encryption key.
4. Dump memory of datanode process to obtain encrypted block data.
* By itself, insufficient to read cleartext without the DEK.
5. Dump memory of namenode process to obtain encrypted data encryption keys.
* By itself, insufficient to read cleartext without the EDEK's encryption key and encrypted block files.
### HDFS admin exploits
These exploits assume that the attacker has compromised HDFS, but does not have root or `hdfs` user shell access.
1. Access to encrypted block files.
* By itself, insufficient to read cleartext without the EDEK and EDEK encryption key.
2. Access to encryption zone and encrypted file metadata (including encrypted data encryption keys), via -fetchImage.
* By itself, insufficient to read cleartext without EDEK encryption keys.
### Rogue user exploits
A rogue user can collect keys of files they have access to, and use them later to decrypt the encrypted data of those files. As the user had access to those files, they already had access to the file contents. This can be mitigated through periodic key rolling policies.

View File

@ -0,0 +1,242 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
ViewFs Guide
============
* [ViewFs Guide](#ViewFs_Guide)
* [Introduction](#Introduction)
* [The Old World (Prior to Federation)](#The_Old_World_Prior_to_Federation)
* [Single Namenode Clusters](#Single_Namenode_Clusters)
* [Pathnames Usage Patterns](#Pathnames_Usage_Patterns)
* [Pathname Usage Best Practices](#Pathname_Usage_Best_Practices)
* [New World Federation and ViewFs](#New_World__Federation_and_ViewFs)
* [How The Clusters Look](#How_The_Clusters_Look)
* [A Global Namespace Per Cluster Using ViewFs](#A_Global_Namespace_Per_Cluster_Using_ViewFs)
* [Pathname Usage Patterns](#Pathname_Usage_Patterns)
* [Pathname Usage Best Practices](#Pathname_Usage_Best_Practices)
* [Renaming Pathnames Across Namespaces](#Renaming_Pathnames_Across_Namespaces)
* [FAQ](#FAQ)
* [Appendix: A Mount Table Configuration Example](#Appendix:_A_Mount_Table_Configuration_Example)
Introduction
------------
The View File System (ViewFs) provides a way to manage multiple Hadoop file system namespaces (or namespace volumes). It is particularly useful for clusters having multiple namenodes, and hence multiple namespaces, in [HDFS Federation](./Federation.html). ViewFs is analogous to *client side mount tables* in some Unix/Linux systems. ViewFs can be used to create personalized namespace views and also per-cluster common views.
This guide is presented in the context of Hadoop systems that have several clusters, each cluster may be federated into multiple namespaces. It also describes how to use ViewFs in federated HDFS to provide a per-cluster global namespace so that applications can operate in a way similar to the pre-federation world.
The Old World (Prior to Federation)
-----------------------------------
### Single Namenode Clusters
In the old world prior to [HDFS Federation](./Federation.html), a cluster has a single namenode which provides a single file system namespace for that cluster. Suppose there are multiple clusters. The file system namespaces of each cluster are completely independent and disjoint. Furthermore, physical storage is NOT shared across clusters (i.e. the Datanodes are not shared across clusters.)
The `core-site.xml` of each cluster has a configuration property that sets the default file system to the namenode of that cluster:
```xml
<property>
<name>fs.default.name</name>
<value>hdfs://namenodeOfClusterX:port</value>
</property>
```
Such a configuration property allows one to use slash-relative names to resolve paths relative to the cluster namenode. For example, the path `/foo/bar` is referring to `hdfs://namenodeOfClusterX:port/foo/bar` using the above configuration.
This configuration property is set on each gateway on the clusters and also on key services of that cluster such the JobTracker and Oozie.
### Pathnames Usage Patterns
Hence on Cluster X where the `core-site.xml` is set as above, the typical pathnames are
1. `/foo/bar`
* This is equivalent to `hdfs://namenodeOfClusterX:port/foo/bar` as before.
2. `hdfs://namenodeOfClusterX:port/foo/bar`
* While this is a valid pathname, one is better using `/foo/bar` as it allows the application and its data to be transparently moved to another cluster when needed.
3. `hdfs://namenodeOfClusterY:port/foo/bar`
* It is an URI for referring a pathname on another cluster such as Cluster Y. In particular, the command for copying files from cluster Y to Cluster Z looks like:
distcp hdfs://namenodeClusterY:port/pathSrc hdfs://namenodeClusterZ:port/pathDest
4. `webhdfs://namenodeClusterX:http_port/foo/bar` and `hftp://namenodeClusterX:http_port/foo/bar`
* These are file system URIs respectively for accessing files via the WebHDFS file system and the HFTP file system. Note that WebHDFS and HFTP use the HTTP port of the namenode but not the RPC port.
5. `http://namenodeClusterX:http_port/webhdfs/v1/foo/bar` and `http://proxyClusterX:http_port/foo/bar`
* These are HTTP URLs respectively for accessing files via [WebHDFS REST API](./WebHDFS.html) and HDFS proxy.
### Pathname Usage Best Practices
When one is within a cluster, it is recommended to use the pathname of type (1) above instead of a fully qualified URI like (2). Fully qualified URIs are similar to addresses and do not allow the application to move along with its data.
New World Federation and ViewFs
---------------------------------
### How The Clusters Look
Suppose there are multiple clusters. Each cluster has one or more namenodes. Each namenode has its own namespace. A namenode belongs to one and only one cluster. The namenodes in the same cluster share the physical storage of that cluster. The namespaces across clusters are independent as before.
Operations decide what is stored on each namenode within a cluster based on the storage needs. For example, they may put all the user data (`/user/<username>`) in one namenode, all the feed-data (`/data`) in another namenode, all the projects (`/projects`) in yet another namenode, etc.
### A Global Namespace Per Cluster Using ViewFs
In order to provide transparency with the old world, the ViewFs file system (i.e. client-side mount table) is used to create each cluster an independent cluster namespace view, which is similar to the namespace in the old world. The client-side mount tables like the Unix mount tables and they mount the new namespace volumes using the old naming convention. The following figure shows a mount table mounting four namespace volumes `/user`, `/data`, `/projects`, and `/tmp`:
![Typical Mount Table for each Cluster](./images/viewfs_TypicalMountTable.png)
ViewFs implements the Hadoop file system interface just like HDFS and the local file system. It is a trivial file system in the sense that it only allows linking to other file systems. Because ViewFs implements the Hadoop file system interface, it works transparently Hadoop tools. For example, all the shell commands work with ViewFs as with HDFS and local file system.
The mount points of a mount table are specified in the standard Hadoop configuration files. In the configuration of each cluster, the default file system is set to the mount table for that cluster as shown below (compare it with the configuration in [Single Namenode Clusters](#Single_Namenode_Clusters)).
```xml
<property>
<name>fs.default.name</name>
<value>viewfs://clusterX</value>
</property>
```
The authority following the `viewfs://` scheme in the URI is the mount table name. It is recommanded that the mount table of a cluster should be named by the cluster name. Then Hadoop system will look for a mount table with the name "clusterX" in the Hadoop configuration files. Operations arrange all gateways and service machines to contain the mount tables for ALL clusters such that, for each cluster, the default file system is set to the ViewFs mount table for that cluster as described above.
### Pathname Usage Patterns
Hence on Cluster X, where the `core-site.xml` is set to make the default fs to use the mount table of that cluster, the typical pathnames are
1. `/foo/bar`
* This is equivalent to `viewfs://clusterX/foo/bar`. If such pathname is used in the old non-federated world, then the transition to federation world is transparent.
2. `viewfs://clusterX/foo/bar`
* While this a valid pathname, one is better using `/foo/bar` as it allows the application and its data to be transparently moved to another cluster when needed.
3. `viewfs://clusterY/foo/bar`
* It is an URI for referring a pathname on another cluster such as Cluster Y. In particular, the command for copying files from cluster Y to Cluster Z looks like:
distcp viewfs://clusterY:/pathSrc viewfs://clusterZ/pathDest
4. `viewfs://clusterX-webhdfs/foo/bar` and `viewfs://clusterX-hftp/foo/bar`
* These are URIs respectively for accessing files via the WebHDFS file system and the HFTP file system.
5. `http://namenodeClusterX:http_port/webhdfs/v1/foo/bar` and `http://proxyClusterX:http_port/foo/bar`
* These are HTTP URLs respectively for accessing files via [WebHDFS REST API](./WebHDFS.html) and HDFS proxy. Note that they are the same as before.
### Pathname Usage Best Practices
When one is within a cluster, it is recommended to use the pathname of type (1) above instead of a fully qualified URI like (2). Futher, applications should not use the knowledge of the mount points and use a path like `hdfs://namenodeContainingUserDirs:port/joe/foo/bar` to refer to a file in a particular namenode. One should use `/user/joe/foo/bar` instead.
### Renaming Pathnames Across Namespaces
Recall that one cannot rename files or directories across namenodes or clusters in the old world. The same is true in the new world but with an additional twist. For example, in the old world one can perform the commend below.
rename /user/joe/myStuff /data/foo/bar
This will NOT work in the new world if `/user` and `/data` are actually stored on different namenodes within a cluster.
### FAQ
1. **As I move from non-federated world to the federated world, I will have to keep track of namenodes for different volumes; how do I do that?**
No, you wont. See the examples above you are either using a relative name and taking advantage of the default file system, or changing your path from `hdfs://namenodeCLusterX/foo/bar` to `viewfs://clusterX/foo/bar`.
2. **What happens of Operations move some files from one namenode to another namenode within a cluster?**
Operations may move files from one namenode to another in order to deal with storage capacity issues. They will do this in a way to avoid applications from breaking. Let's take some examples.
* Example 1: `/user` and `/data` were on one namenode and later they need to be on separate namenodes to deal with capacity issues. Indeed, operations would have created separate mount points for `/user` and `/data`. Prior to the change the mounts for `/user` and `/data` would have pointed to the same namenode, say `namenodeContainingUserAndData`. Operations will update the mount tables so that the mount points are changed to `namenodeContaingUser` and `namenodeContainingData`, respectively.
* Example 2: All projects were fitted on one namenode and but later they need two or more namenodes. ViewFs allows mounts like `/project/foo` and `/project/bar`. This allows mount tables to be updated to point to the corresponding namenode.
3. **Is the mount table in each** `core-site.xml` **or in a separate file of its own?**
The plan is to keep the mount tables in separate files and have the `core-site.xml` [xincluding](http://www.w3.org/2001/XInclude) it. While one can keep these files on each machine locally, it is better to use HTTP to access it from a central location.
4. **Should the configuration have the mount table definitions for only one cluster or all clusters?**
The configuration should have the mount definitions for all clusters since one needs to have access to data in other clusters such as with distcp.
5. **When is the mount table actually read given that Operations may change a mount table over time?**
The mount table is read when the job is submitted to the cluster. The `XInclude` in `core-site.xml` is expanded at job submission time. This means that if the mount table are changed then the jobs need to be resubmitted. Due to this reason, we want to implement merge-mount which will greatly reduce the need to change mount tables. Further, we would like to read the mount tables via another mechanism that is initialized at job start time in the future.
6. **Will JobTracker (or Yarns Resource Manager) itself use the ViewFs?**
No, it does not need to. Neither does the NodeManager.
7. **Does ViewFs allow only mounts at the top level?**
No; it is more general. For example, one can mount `/user/joe` and `/user/jane`. In this case, an internal read-only directory is created for `/user` in the mount table. All operations on `/user` are valid except that `/user` is read-only.
8. **An application works across the clusters and needs to persistently store file paths. Which paths should it store?**
You should store `viewfs://cluster/path` type path names, the same as it uses when running applications. This insulates you from movement of data within namenodes inside a cluster as long as operations do the moves in a transparent fashion. It does not insulate you if data gets moved from one cluster to another; the older (pre-federation) world did not protect you form such data movements across clusters anyway.
9. **What about delegation tokens?**
Delegation tokens for the cluster to which you are submitting the job (including all mounted volumes for that clusters mount table), and for input and output paths to your map-reduce job (including all volumes mounted via mount tables for the specified input and output paths) are all handled automatically. In addition, there is a way to add additional delegation tokens to the base cluster configuration for special circumstances.
Appendix: A Mount Table Configuration Example
---------------------------------------------
Generally, users do not have to define mount tables or the `core-site.xml` to use the mount table. This is done by operations and the correct configuration is set on the right gateway machines as is done for `core-site.xml` today.
The mount tables can be described in `core-site.xml` but it is better to use indirection in `core-site.xml` to reference a separate configuration file, say `mountTable.xml`. Add the following configuration element to `core-site.xml` for referencing `mountTable.xml`:
```xml
<configuration xmlns:xi="http://www.w3.org/2001/XInclude">
<xi:include href="mountTable.xml" />
</configuration>
```
In the file `mountTable.xml`, there is a definition of the mount table "ClusterX" for the hypothetical cluster that is a federation of the three namespace volumes managed by the three namenodes
1. nn1-clusterx.example.com:8020,
2. nn2-clusterx.example.com:8020, and
3. nn3-clusterx.example.com:8020.
Here `/home` and `/tmp` are in the namespace managed by namenode nn1-clusterx.example.com:8020, and projects `/foo` and `/bar` are hosted on the other namenodes of the federated cluster. The home directory base path is set to `/home` so that each user can access its home directory using the getHomeDirectory() method defined in [FileSystem](../../api/org/apache/hadoop/fs/FileSystem.html)/[FileContext](../../api/org/apache/hadoop/fs/FileContext.html).
```xml
<configuration>
<property>
<name>fs.viewfs.mounttable.ClusterX.homedir</name>
<value>/home</value>
</property>
<property>
<name>fs.viewfs.mounttable.ClusterX.link./home</name>
<value>hdfs://nn1-clusterx.example.com:8020/home</value>
</property>
<property>
<name>fs.viewfs.mounttable.ClusterX.link./tmp</name>
<value>hdfs://nn1-clusterx.example.com:8020/tmp</value>
</property>
<property>
<name>fs.viewfs.mounttable.ClusterX.link./projects/foo</name>
<value>hdfs://nn2-clusterx.example.com:8020/projects/foo</value>
</property>
<property>
<name>fs.viewfs.mounttable.ClusterX.link./projects/bar</name>
<value>hdfs://nn3-clusterx.example.com:8020/projects/bar</value>
</property>
</configuration>
```

File diff suppressed because it is too large Load Diff