diff --git a/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt b/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt index 9176ec7ecca..c9bee1a5b7d 100644 --- a/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt +++ b/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt @@ -290,6 +290,9 @@ Trunk (Unreleased) HADOOP-11484. hadoop-mapreduce-client-nativetask fails to build on ARM AARCH64 due to x86 asm statements (Edward Nevill via Colin P. McCabe) + HDFS-7667. Various typos and improvements to HDFS Federation doc + (Charles Lamb via aw) + Release 2.7.0 - UNRELEASED INCOMPATIBLE CHANGES diff --git a/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/Federation.apt.vm b/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/Federation.apt.vm index 29278b7197a..17aaf3ce597 100644 --- a/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/Federation.apt.vm +++ b/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/Federation.apt.vm @@ -32,16 +32,16 @@ HDFS Federation * <> - * Consists of directories, files and blocks + * Consists of directories, files and blocks. * It supports all the namespace related file system operations such as create, delete, modify and list files and directories. - * <> has two parts + * <>, which has two parts: - * Block Management (which is done in Namenode) + * Block Management (performed in the Namenode) - * Provides datanode cluster membership by handling registrations, and + * Provides Datanode cluster membership by handling registrations, and periodic heart beats. * Processes block reports and maintains location of blocks. @@ -49,29 +49,29 @@ HDFS Federation * Supports block related operations such as create, delete, modify and get block location. - * Manages replica placement and replication of a block for under - replicated blocks and deletes blocks that are over replicated. + * Manages replica placement, block replication for under + replicated blocks, and deletes blocks that are over replicated. - * Storage - is provided by datanodes by storing blocks on the local file - system and allows read/write access. + * Storage - is provided by Datanodes by storing blocks on the local file + system and allowing read/write access. The prior HDFS architecture allows only a single namespace for the - entire cluster. A single Namenode manages this namespace. HDFS - Federation addresses limitation of the prior architecture by adding - support multiple Namenodes/namespaces to HDFS file system. + entire cluster. In that configuration, a single Namenode manages the + namespace. HDFS Federation addresses this limitation by adding + support for multiple Namenodes/namespaces to HDFS. * {Multiple Namenodes/Namespaces} In order to scale the name service horizontally, federation uses multiple - independent Namenodes/namespaces. The Namenodes are federated, that is, the + independent Namenodes/namespaces. The Namenodes are federated; the Namenodes are independent and do not require coordination with each other. - The datanodes are used as common storage for blocks by all the Namenodes. - Each datanode registers with all the Namenodes in the cluster. Datanodes - send periodic heartbeats and block reports and handles commands from the - Namenodes. + The Datanodes are used as common storage for blocks by all the Namenodes. + Each Datanode registers with all the Namenodes in the cluster. Datanodes + send periodic heartbeats and block reports. They also handle + commands from the Namenodes. - Users may use {{{./ViewFs.html}ViewFs}} to create personalized namespace views, - where ViewFs is analogous to client side mount tables in some Unix/Linux systems. + Users may use {{{./ViewFs.html}ViewFs}} to create personalized namespace views. + ViewFs is analogous to client side mount tables in some Unix/Linux systems. [./images/federation.gif] HDFS Federation Architecture @@ -79,66 +79,67 @@ HDFS Federation <> A Block Pool is a set of blocks that belong to a single namespace. - Datanodes store blocks for all the block pools in the cluster. - It is managed independently of other block pools. This allows a namespace - to generate Block IDs for new blocks without the need for coordination - with the other namespaces. The failure of a Namenode does not prevent - the datanode from serving other Namenodes in the cluster. + Datanodes store blocks for all the block pools in the cluster. Each + Block Pool is managed independently. This allows a namespace to + generate Block IDs for new blocks without the need for coordination + with the other namespaces. A Namenode failure does not prevent the + Datanode from serving other Namenodes in the cluster. A Namespace and its block pool together are called Namespace Volume. It is a self-contained unit of management. When a Namenode/namespace - is deleted, the corresponding block pool at the datanodes is deleted. + is deleted, the corresponding block pool at the Datanodes is deleted. Each namespace volume is upgraded as a unit, during cluster upgrade. <> - A new identifier <> is added to identify all the nodes in - the cluster. When a Namenode is formatted, this identifier is provided - or auto generated. This ID should be used for formatting the other - Namenodes into the cluster. + A <> identifier is used to identify all the nodes in the + cluster. When a Namenode is formatted, this identifier is either + provided or auto generated. This ID should be used for formatting + the other Namenodes into the cluster. ** Key Benefits - * Namespace Scalability - HDFS cluster storage scales horizontally but - the namespace does not. Large deployments or deployments using lot - of small files benefit from scaling the namespace by adding more - Namenodes to the cluster + * Namespace Scalability - Federation adds namespace horizontal + scaling. Large deployments or deployments using lot of small files + benefit from namespace scaling by allowing more Namenodes to be + added to the cluster. - * Performance - File system operation throughput is limited by a single - Namenode in the prior architecture. Adding more Namenodes to the cluster - scales the file system read/write operations throughput. + * Performance - File system throughput is not limited by a single + Namenode. Adding more Namenodes to the cluster scales the file + system read/write throughput. - * Isolation - A single Namenode offers no isolation in multi user - environment. An experimental application can overload the Namenode - and slow down production critical applications. With multiple Namenodes, - different categories of applications and users can be isolated to - different namespaces. + * Isolation - A single Namenode offers no isolation in a multi user + environment. For example, an experimental application can overload + the Namenode and slow down production critical applications. By using + multiple Namenodes, different categories of applications and users + can be isolated to different namespaces. * {Federation Configuration} - Federation configuration is <> and allows existing - single Namenode configuration to work without any change. The new - configuration is designed such that all the nodes in the cluster have - same configuration without the need for deploying different configuration - based on the type of the node in the cluster. + Federation configuration is <> and allows + existing single Namenode configurations to work without any + change. The new configuration is designed such that all the nodes in + the cluster have the same configuration without the need for + deploying different configurations based on the type of the node in + the cluster. - A new abstraction called <<>> is added with - federation. The Namenode and its corresponding secondary/backup/checkpointer - nodes belong to this. To support single configuration file, the Namenode and - secondary/backup/checkpointer configuration parameters are suffixed with - <<>> and are added to the same configuration file. + Federation adds a new <<>> abstraction. A Namenode + and its corresponding secondary/backup/checkpointer nodes all belong + to a NameServiceId. In order to support a single configuration file, + the Namenode and secondary/backup/checkpointer configuration + parameters are suffixed with the <<>>. ** Configuration: - <>: Add the following parameters to your configuration: - <<>>: Configure with list of comma separated - NameServiceIDs. This will be used by Datanodes to determine all the + <>: Add the <<>> parameter to your + configuration and configure it with a list of comma separated + NameServiceIDs. This will be used by the Datanodes to determine the Namenodes in the cluster. <>: For each Namenode and Secondary Namenode/BackupNode/Checkpointer - add the following configuration suffixed with the corresponding - <<>> into the common configuration file. + add the following configuration parameters suffixed with the corresponding + <<>> into the common configuration file: *---------------------+--------------------------------------------+ || Daemon || Configuration Parameter | @@ -160,7 +161,7 @@ HDFS Federation | | <<>> | *---------------------+--------------------------------------------+ - Here is an example configuration with two namenodes: + Here is an example configuration with two Namenodes: ---- @@ -199,16 +200,16 @@ HDFS Federation ** Formatting Namenodes - <>: Format a namenode using the following command: + <>: Format a Namenode using the following command: ---- [hdfs]$ $HADOOP_PREFIX/bin/hdfs namenode -format [-clusterId ] ---- - Choose a unique cluster_id, which will not conflict other clusters in - your environment. If it is not provided, then a unique ClusterID is + Choose a unique cluster_id which will not conflict other clusters in + your environment. If a cluster_id is not provided, then a unique one is auto generated. - <>: Format additional namenode using the following command: + <>: Format additional Namenodes using the following command: ---- [hdfs]$ $HADOOP_PREFIX/bin/hdfs namenode -format -clusterId @@ -219,40 +220,38 @@ HDFS Federation ** Upgrading from an older release and configuring federation - Older releases supported a single Namenode. - Upgrade the cluster to newer release to enable federation + Older releases only support a single Namenode. + Upgrade the cluster to newer release in order to enable federation During upgrade you can provide a ClusterID as follows: ---- -[hdfs]$ $HADOOP_PREFIX/bin/hdfs start namenode --config $HADOOP_CONF_DIR -upgrade -clusterId +[hdfs]$ $HADOOP_PREFIX/bin/hdfs --daemon start namenode -upgrade -clusterId ---- - If ClusterID is not provided, it is auto generated. + If cluster_id is not provided, it is auto generated. ** Adding a new Namenode to an existing HDFS cluster - Follow the following steps: + Perform the following steps: - * Add configuration parameter <<>> to the configuration. + * Add <<>> to the configuration. - * Update the configuration with NameServiceID suffix. Configuration - key names have changed post release 0.20. You must use new configuration - parameter names, for federation. + * Update the configuration with the NameServiceID suffix. Configuration + key names changed post release 0.20. You must use the new configuration + parameter names in order to use federation. - * Add new Namenode related config to the configuration files. + * Add the new Namenode related config to the configuration file. * Propagate the configuration file to the all the nodes in the cluster. - * Start the new Namenode, Secondary/Backup. + * Start the new Namenode and Secondary/Backup. - * Refresh the datanodes to pickup the newly added Namenode by running - the following command: + * Refresh the Datanodes to pickup the newly added Namenode by running + the following command against all the Datanodes in the cluster: ---- -[hdfs]$ $HADOOP_PREFIX/bin/hdfs dfadmin -refreshNameNode : +[hdfs]$ $HADOOP_PREFIX/bin/hdfs dfsadmin -refreshNameNode : ---- - * The above command must be run against all the datanodes in the cluster. - * {Managing the cluster} ** Starting and stopping cluster @@ -270,28 +269,28 @@ HDFS Federation ---- These commands can be run from any node where the HDFS configuration is - available. The command uses configuration to determine the Namenodes - in the cluster and starts the Namenode process on those nodes. The - datanodes are started on nodes specified in the <<>> file. The - script can be used as reference for building your own scripts for - starting and stopping the cluster. + available. The command uses the configuration to determine the Namenodes + in the cluster and then starts the Namenode process on those nodes. The + Datanodes are started on the nodes specified in the <<>> file. The + script can be used as a reference for building your own scripts to + start and stop the cluster. ** Balancer - Balancer has been changed to work with multiple Namenodes in the cluster to - balance the cluster. Balancer can be run using the command: + The Balancer has been changed to work with multiple + Namenodes. The Balancer can be run using the command: ---- [hdfs]$ $HADOOP_PREFIX/bin/hdfs --daemon start balancer [-policy ] ---- - Policy could be: + The policy parameter can be any of the following: * <<>> - this is the policy. This balances the storage at - the datanode level. This is similar to balancing policy from prior releases. + the Datanode level. This is similar to balancing policy from prior releases. - * <<>> - this balances the storage at the block pool level. - Balancing at block pool level balances storage at the datanode level also. + * <<>> - this balances the storage at the block pool + level which also balances at the Datanode level. Note that Balancer only balances the data and does not balance the namespace. For the complete command usage, see {{{../hadoop-common/CommandsManual.html#balancer}balancer}}. @@ -299,44 +298,42 @@ HDFS Federation ** Decommissioning Decommissioning is similar to prior releases. The nodes that need to be - decomissioned are added to the exclude file at all the Namenode. Each + decomissioned are added to the exclude file at all of the Namenodes. Each Namenode decommissions its Block Pool. When all the Namenodes finish - decommissioning a datanode, the datanode is considered to be decommissioned. + decommissioning a Datanode, the Datanode is considered decommissioned. - <>: To distributed an exclude file to all the Namenodes, use the + <>: To distribute an exclude file to all the Namenodes, use the following command: ---- -[hdfs]$ $HADOOP_PREFIX/sbin/distributed-exclude.sh +[hdfs]$ $HADOOP_PREFIX/sbin/distribute-exclude.sh ---- - <>: Refresh all the Namenodes to pick up the new exclude file. + <>: Refresh all the Namenodes to pick up the new exclude file: ---- [hdfs]$ $HADOOP_PREFIX/sbin/refresh-namenodes.sh ---- - The above command uses HDFS configuration to determine the Namenodes - configured in the cluster and refreshes all the Namenodes to pick up + The above command uses HDFS configuration to determine the + configured Namenodes in the cluster and refreshes them to pick up the new exclude file. ** Cluster Web Console - Similar to Namenode status web page, a Cluster Web Console is added in - federation to monitor the federated cluster at + Similar to the Namenode status web page, when using federation a + Cluster Web Console is available to monitor the federated cluster at <</dfsclusterhealth.jsp>>>. Any Namenode in the cluster can be used to access this web page. - The web page provides the following information: + The Cluster Web Console provides the following information: - * Cluster summary that shows number of files, number of blocks and - total configured storage capacity, available and used storage information + * A cluster summary that shows the number of files, number of blocks, + total configured storage capacity, and the available and used storage for the entire cluster. - * Provides list of Namenodes and summary that includes number of files, - blocks, missing blocks, number of live and dead data nodes for each - Namenode. It also provides a link to conveniently access Namenode web UI. - - * It also provides decommissioning status of datanodes. - + * A list of Namenodes and a summary that includes the number of files, + blocks, missing blocks, and live and dead data nodes for each + Namenode. It also provides a link to access each Namenode's web UI. + * The decommissioning status of Datanodes.