diff --git a/hadoop-hdfs-project/hadoop-hdfs/CHANGES.HDFS-3077.txt b/hadoop-hdfs-project/hadoop-hdfs/CHANGES.HDFS-3077.txt index d1b9a9b58f8..3f0c3d72273 100644 --- a/hadoop-hdfs-project/hadoop-hdfs/CHANGES.HDFS-3077.txt +++ b/hadoop-hdfs-project/hadoop-hdfs/CHANGES.HDFS-3077.txt @@ -72,3 +72,5 @@ HDFS-3906. QJM: quorum timeout on failover with large log segment (todd) HDFS-3840. JournalNodes log JournalNotFormattedException backtrace error before being formatted (todd) HDFS-3894. QJM: testRecoverAfterDoubleFailures can be flaky due to IPC client caching (todd) + +HDFS-3926. QJM: Add user documentation for QJM. (atm) diff --git a/hadoop-project/src/site/site.xml b/hadoop-project/src/site/site.xml index f449b5d9b40..ae2ab37cad4 100644 --- a/hadoop-project/src/site/site.xml +++ b/hadoop-project/src/site/site.xml @@ -54,7 +54,8 @@ - + + diff --git a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/HDFSHighAvailability.apt.vm b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/HDFSHighAvailabilityWithNFS.apt.vm similarity index 95% rename from hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/HDFSHighAvailability.apt.vm rename to hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/HDFSHighAvailabilityWithNFS.apt.vm index 7e7cb66772f..efa3f931bbc 100644 --- a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/HDFSHighAvailability.apt.vm +++ b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/HDFSHighAvailabilityWithNFS.apt.vm @@ -25,12 +25,21 @@ HDFS High Availability * {Purpose} This guide provides an overview of the HDFS High Availability (HA) feature and - how to configure and manage an HA HDFS cluster. + how to configure and manage an HA HDFS cluster, using NFS for the shared + storage required by the NameNodes. This document assumes that the reader has a general understanding of general components and node types in an HDFS cluster. Please refer to the HDFS Architecture guide for details. +* {Note: Using the Quorum Journal Manager or Conventional Shared Storage} + + This guide discusses how to configure and use HDFS HA using a shared NFS + directory to share edit logs between the Active and Standby NameNodes. For + information on how to configure HDFS HA using the Quorum Journal Manager + instead of NFS, please see {{{./HDFSHighAvailabilityWithQJM.html}this + alternative guide.}} + * {Background} Prior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in @@ -297,7 +306,7 @@ HDFS High Availability dfs.ha.fencing.ssh.connect-timeout - + 30000 --- @@ -375,17 +384,22 @@ HDFS High Availability ** Deployment details After all of the necessary configuration options have been set, one must - initially synchronize the two HA NameNodes' on-disk metadata. If you are - setting up a fresh HDFS cluster, you should first run the format command () on one of NameNodes. If you have already formatted the - NameNode, or are converting a non-HA-enabled cluster to be HA-enabled, you - should now copy over the contents of your NameNode metadata directories to - the other, unformatted NameNode using or a similar utility. The location - of the directories containing the NameNode metadata are configured via the - configuration options <> and/or - <>. At this time, you should also ensure that the - shared edits dir (as configured by <>) includes - all recent edits files which are in your NameNode metadata directories. + initially synchronize the two HA NameNodes' on-disk metadata. + + * If you are setting up a fresh HDFS cluster, you should first run the format + command () on one of NameNodes. + + * If you have already formatted the NameNode, or are converting a + non-HA-enabled cluster to be HA-enabled, you should now copy over the + contents of your NameNode metadata directories to the other, unformatted + NameNode by running the command "" on the + unformatted NameNode. Running this command will also ensure that the shared + edits directory (as configured by <>) contains + sufficient edits transactions to be able to start both NameNodes. + + * If you are converting a non-HA NameNode to be HA, you should run the + command "", which will initialize the shared + edits directory with the edits data from the local NameNode edits directories. At this point you may start both of your HA NameNodes as you normally would start a NameNode. @@ -863,4 +877,4 @@ $ zkCli.sh create /ledgers/available 0 3) Auto-Recovery of storage node failures. Work inprogress {{{https://issues.apache.org/jira/browse/BOOKKEEPER-237 }BOOKKEEPER-237}}. - Currently we have the tools to manually recover the data from failed storage nodes. \ No newline at end of file + Currently we have the tools to manually recover the data from failed storage nodes. diff --git a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/HDFSHighAvailabilityWithQJM.apt.vm b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/HDFSHighAvailabilityWithQJM.apt.vm new file mode 100644 index 00000000000..2aefc3584c0 --- /dev/null +++ b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/HDFSHighAvailabilityWithQJM.apt.vm @@ -0,0 +1,767 @@ +~~ Licensed under the Apache License, Version 2.0 (the "License"); +~~ you may not use this file except in compliance with the License. +~~ You may obtain a copy of the License at +~~ +~~ http://www.apache.org/licenses/LICENSE-2.0 +~~ +~~ Unless required by applicable law or agreed to in writing, software +~~ distributed under the License is distributed on an "AS IS" BASIS, +~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +~~ See the License for the specific language governing permissions and +~~ limitations under the License. See accompanying LICENSE file. + + --- + Hadoop Distributed File System-${project.version} - High Availability + --- + --- + ${maven.build.timestamp} + +HDFS High Availability Using the Quorum Journal Manager + + \[ {{{./index.html}Go Back}} \] + +%{toc|section=1|fromDepth=0} + +* {Purpose} + + This guide provides an overview of the HDFS High Availability (HA) feature + and how to configure and manage an HA HDFS cluster, using the Quorum Journal + Manager (QJM) feature. + + This document assumes that the reader has a general understanding of + general components and node types in an HDFS cluster. Please refer to the + HDFS Architecture guide for details. + +* {Note: Using the Quorum Journal Manager or Conventional Shared Storage} + + This guide discusses how to configure and use HDFS HA using the Quorum + Journal Manager (QJM) to share edit logs between the Active and Standby + NameNodes. For information on how to configure HDFS HA using NFS for shared + storage instead of the QJM, please see + {{{./HDFSHighAvailabilityWithNFS.html}this alternative guide.}} + +* {Background} + + Prior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in + an HDFS cluster. Each cluster had a single NameNode, and if that machine or + process became unavailable, the cluster as a whole would be unavailable + until the NameNode was either restarted or brought up on a separate machine. + + This impacted the total availability of the HDFS cluster in two major ways: + + * In the case of an unplanned event such as a machine crash, the cluster would + be unavailable until an operator restarted the NameNode. + + * Planned maintenance events such as software or hardware upgrades on the + NameNode machine would result in windows of cluster downtime. + + The HDFS High Availability feature addresses the above problems by providing + the option of running two redundant NameNodes in the same cluster in an + Active/Passive configuration with a hot standby. This allows a fast failover to + a new NameNode in the case that a machine crashes, or a graceful + administrator-initiated failover for the purpose of planned maintenance. + +* {Architecture} + + In a typical HA cluster, two separate machines are configured as NameNodes. + At any point in time, exactly one of the NameNodes is in an state, + and the other is in a state. The Active NameNode is responsible + for all client operations in the cluster, while the Standby is simply acting + as a slave, maintaining enough state to provide a fast failover if + necessary. + + In order for the Standby node to keep its state synchronized with the Active + node, both nodes communicate with a group of separate daemons called + "JournalNodes" (JNs). When any namespace modification is performed by the + Active node, it durably logs a record of the modification to a majority of + these JNs. The Standby node is capable of reading the edits from the JNs, and + is constantly watching them for changes to the edit log. As the Standby Node + sees the edits, it applies them to its own namespace. In the event of a + failover, the Standby will ensure that it has read all of the edits from the + JounalNodes before promoting itself to the Active state. This ensures that the + namespace state is fully synchronized before a failover occurs. + + In order to provide a fast failover, it is also necessary that the Standby node + have up-to-date information regarding the location of blocks in the cluster. + In order to achieve this, the DataNodes are configured with the location of + both NameNodes, and send block location information and heartbeats to both. + + It is vital for the correct operation of an HA cluster that only one of the + NameNodes be Active at a time. Otherwise, the namespace state would quickly + diverge between the two, risking data loss or other incorrect results. In + order to ensure this property and prevent the so-called "split-brain scenario," + the JournalNodes will only ever allow a single NameNode to be a writer at a + time. During a failover, the NameNode which is to become active will simply + take over the role of writing to the JournalNodes, which will effectively + prevent the other NameNode from continuing in the Active state, allowing the + new Active to safely proceed with failover. + +* {Hardware resources} + + In order to deploy an HA cluster, you should prepare the following: + + * <> - the machines on which you run the Active and + Standby NameNodes should have equivalent hardware to each other, and + equivalent hardware to what would be used in a non-HA cluster. + + * <> - the machines on which you run the JournalNodes. + The JournalNode daemon is relatively lightweight, so these daemons may + reasonably be collocated on machines with other Hadoop daemons, for example + NameNodes, the JobTracker, or the YARN ResourceManager. <> There + must be at least 3 JournalNode daemons, since edit log modifications must be + written to a majority of JNs. This will allow the system to tolerate the + failure of a single machine. You may also run more than 3 JournalNodes, but + in order to actually increase the number of failures the system can tolerate, + you should run an odd number of JNs, (i.e. 3, 5, 7, etc.). Note that when + running with N JournalNodes, the system can tolerate at most (N - 1) / 2 + failures and continue to function normally. + + Note that, in an HA cluster, the Standby NameNode also performs checkpoints of + the namespace state, and thus it is not necessary to run a Secondary NameNode, + CheckpointNode, or BackupNode in an HA cluster. In fact, to do so would be an + error. This also allows one who is reconfiguring a non-HA-enabled HDFS cluster + to be HA-enabled to reuse the hardware which they had previously dedicated to + the Secondary NameNode. + +* {Deployment} + +** Configuration overview + + Similar to Federation configuration, HA configuration is backward compatible + and allows existing single NameNode configurations to work without change. + The new configuration is designed such that all the nodes in the cluster may + have the same configuration without the need for deploying different + configuration files to different machines based on the type of the node. + + Like HDFS Federation, HA clusters reuse the <<>> to identify a + single HDFS instance that may in fact consist of multiple HA NameNodes. In + addition, a new abstraction called <<>> is added with HA. Each + distinct NameNode in the cluster has a different NameNode ID to distinguish it. + To support a single configuration file for all of the NameNodes, the relevant + configuration parameters are suffixed with the <> as well as + the <>. + +** Configuration details + + To configure HA NameNodes, you must add several configuration options to your + <> configuration file. + + The order in which you set these configurations is unimportant, but the values + you choose for <> and + <> will determine the keys of those that + follow. Thus, you should decide on these values before setting the rest of the + configuration options. + + * <> - the logical name for this new nameservice + + Choose a logical name for this nameservice, for example "mycluster", and use + this logical name for the value of this config option. The name you choose is + arbitrary. It will be used both for configuration and as the authority + component of absolute HDFS paths in the cluster. + + <> If you are also using HDFS Federation, this configuration setting + should also include the list of other nameservices, HA or otherwise, as a + comma-separated list. + +---- + + dfs.nameservices + mycluster + +---- + + * <> - unique identifiers for each NameNode in the nameservice + + Configure with a list of comma-separated NameNode IDs. This will be used by + DataNodes to determine all the NameNodes in the cluster. For example, if you + used "mycluster" as the nameservice ID previously, and you wanted to use "nn1" + and "nn2" as the individual IDs of the NameNodes, you would configure this as + such: + +---- + + dfs.ha.namenodes.mycluster + nn1,nn2 + +---- + + <> Currently, only a maximum of two NameNodes may be configured per + nameservice. + + * <> - the fully-qualified RPC address for each NameNode to listen on + + For both of the previously-configured NameNode IDs, set the full address and + IPC port of the NameNode processs. Note that this results in two separate + configuration options. For example: + +---- + + dfs.namenode.rpc-address.mycluster.nn1 + machine1.example.com:8020 + + + dfs.namenode.rpc-address.mycluster.nn2 + machine2.example.com:8020 + +---- + + <> You may similarly configure the "<>" setting if + you so desire. + + * <> - the fully-qualified HTTP address for each NameNode to listen on + + Similarly to above, set the addresses for both NameNodes' HTTP + servers to listen on. For example: + +---- + + dfs.namenode.http-address.mycluster.nn1 + machine1.example.com:50070 + + + dfs.namenode.http-address.mycluster.nn2 + machine2.example.com:50070 + +---- + + <> If you have Hadoop's security features enabled, you should also set + the similarly for each NameNode. + + * <> - the URI which identifies the group of JNs where the NameNodes will write/read edits + + This is where one configures the addresses of the JournalNodes which provide + the shared edits storage, written to by the Active nameNode and read by the + Standby NameNode to stay up-to-date with all the file system changes the Active + NameNode makes. Though you must specify several JournalNode addresses, + <> The URI should be of the form: + "qjournal://;;/". The Journal + ID is a unique identifier for this nameservice, which allows a single set of + JournalNodes to provide storage for multiple federated namesystems. Though not + a requirement, it's a good idea to reuse the nameservice ID for the journal + identifier. + + For example, if the JournalNodes for this cluster were running on the + machines "node1.example.com", "node2.example.com", and "node3.example.com" and + the nameservice ID were "mycluster", you would use the following as the value + for this setting (the default port for the JournalNode is 8485): + +---- + + dfs.namenode.shared.edits.dir + qjournal://node1.example.com:8485;node2.example.com:8485;node3.example.com:8485/mycluster + +---- + + * <> - the Java class that HDFS clients use to contact the Active NameNode + + Configure the name of the Java class which will be used by the DFS Client to + determine which NameNode is the current Active, and therefore which NameNode is + currently serving client requests. The only implementation which currently + ships with Hadoop is the <>, so use this + unless you are using a custom one. For example: + +---- + + dfs.client.failover.proxy.provider.mycluster + org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider + +---- + + * <> - a list of scripts or Java classes which will be used to fence the Active NameNode during a failover + + It is desirable for correctness of the system that only one NameNode be in + the Active state at any given time. <> However, when a failover occurs, it is still + possible that the previous Active NameNode could serve read requests to + clients, which may be out of date until that NameNode shuts down when trying to + write to the JournalNodes. For this reason, it is still desirable to configure + some fencing methods even when using the Quorum Journal Manager. However, to + improve the availability of the system in the event the fencing mechanisms + fail, it is advisable to configure a fencing method which is guaranteed to + return success as the last fencing method in the list. Note that if you choose + to use no actual fencing methods, you still must configure something for this + setting, for example "<<>>". + + The fencing methods used during a failover are configured as a + carriage-return-separated list, which will be attempted in order until one + indicates that fencing has succeeded. There are two methods which ship with + Hadoop: and . For information on implementing your own custom + fencing method, see the class. + + * <> - SSH to the Active NameNode and kill the process + + The option SSHes to the target node and uses to kill the + process listening on the service's TCP port. In order for this fencing option + to work, it must be able to SSH to the target node without providing a + passphrase. Thus, one must also configure the + <> option, which is a + comma-separated list of SSH private key files. For example: + +--- + + dfs.ha.fencing.methods + sshfence + + + + dfs.ha.fencing.ssh.private-key-files + /home/exampleuser/.ssh/id_rsa + +--- + + Optionally, one may configure a non-standard username or port to perform the + SSH. One may also configure a timeout, in milliseconds, for the SSH, after + which this fencing method will be considered to have failed. It may be + configured like so: + +--- + + dfs.ha.fencing.methods + sshfence([[username][:port]]) + + + dfs.ha.fencing.ssh.connect-timeout + 30000 + +--- + + * <> - run an arbitrary shell command to fence the Active NameNode + + The fencing method runs an arbitrary shell command. It may be + configured like so: + +--- + + dfs.ha.fencing.methods + shell(/path/to/my/script.sh arg1 arg2 ...) + +--- + + The string between '(' and ')' is passed directly to a bash shell and may not + include any closing parentheses. + + The shell command will be run with an environment set up to contain all of the + current Hadoop configuration variables, with the '_' character replacing any + '.' characters in the configuration keys. The configuration used has already had + any namenode-specific configurations promoted to their generic forms -- for example + <> will contain the RPC address of the target node, even + though the configuration may specify that variable as + <>. + + Additionally, the following variables referring to the target node to be fenced + are also available: + +*-----------------------:-----------------------------------+ +| $target_host | hostname of the node to be fenced | +*-----------------------:-----------------------------------+ +| $target_port | IPC port of the node to be fenced | +*-----------------------:-----------------------------------+ +| $target_address | the above two, combined as host:port | +*-----------------------:-----------------------------------+ +| $target_nameserviceid | the nameservice ID of the NN to be fenced | +*-----------------------:-----------------------------------+ +| $target_namenodeid | the namenode ID of the NN to be fenced | +*-----------------------:-----------------------------------+ + + These environment variables may also be used as substitutions in the shell + command itself. For example: + +--- + + dfs.ha.fencing.methods + shell(/path/to/my/script.sh --nameservice=$target_nameserviceid $target_host:$target_port) + +--- + + If the shell command returns an exit + code of 0, the fencing is determined to be successful. If it returns any other + exit code, the fencing was not successful and the next fencing method in the + list will be attempted. + + <> This fencing method does not implement any timeout. If timeouts are + necessary, they should be implemented in the shell script itself (eg by forking + a subshell to kill its parent in some number of seconds). + + * <> - the default path prefix used by the Hadoop FS client when none is given + + Optionally, you may now configure the default path for Hadoop clients to use + the new HA-enabled logical URI. If you used "mycluster" as the nameservice ID + earlier, this will be the value of the authority portion of all of your HDFS + paths. This may be configured like so, in your <> file: + +--- + + fs.defaultFS + hdfs://mycluster + +--- + + + * <> - the path where the JournalNode daemon will store its local state + + This is the absolute path on the JournalNode machines where the edits and + other local state used by the JNs will be stored. You may only use a single + path for this configuration. Redundancy for this data is provided by running + multiple separate JournalNodes, or by configuring this directory on a + locally-attached RAID array. For example: + +--- + + dfs.journalnode.edits.dir + /path/to/journal/node/local/data + +--- + +** Deployment details + + After all of the necessary configuration options have been set, you must + start the JournalNode daemons on the set of machines where they will run. This + can be done by running the command "" and waiting + for the daemon to start on each of the relevant machines. + + Once the JournalNodes have been started, one must initially synchronize the + two HA NameNodes' on-disk metadata. + + * If you are setting up a fresh HDFS cluster, you should first run the format + command () on one of NameNodes. + + * If you have already formatted the NameNode, or are converting a + non-HA-enabled cluster to be HA-enabled, you should now copy over the + contents of your NameNode metadata directories to the other, unformatted + NameNode by running the command "" on the + unformatted NameNode. Running this command will also ensure that the + JournalNodes (as configured by <>) contain + sufficient edits transactions to be able to start both NameNodes. + + * If you are converting a non-HA NameNode to be HA, you should run the + command "", which will initialize the + JournalNodes with the edits data from the local NameNode edits directories. + + At this point you may start both of your HA NameNodes as you normally would + start a NameNode. + + You can visit each of the NameNodes' web pages separately by browsing to their + configured HTTP addresses. You should notice that next to the configured + address will be the HA state of the NameNode (either "standby" or "active".) + Whenever an HA NameNode starts, it is initially in the Standby state. + +** Administrative commands + + Now that your HA NameNodes are configured and started, you will have access + to some additional commands to administer your HA HDFS cluster. Specifically, + you should familiarize yourself with all of the subcommands of the "" command. Running this command without any additional arguments will + display the following usage information: + +--- +Usage: DFSHAAdmin [-ns ] + [-transitionToActive ] + [-transitionToStandby ] + [-failover [--forcefence] [--forceactive] ] + [-getServiceState ] + [-checkHealth ] + [-help ] +--- + + This guide describes high-level uses of each of these subcommands. For + specific usage information of each subcommand, you should run ">". + + * <> and <> - transition the state of the given NameNode to Active or Standby + + These subcommands cause a given NameNode to transition to the Active or Standby + state, respectively. <> Instead, one should almost always prefer to + use the "" subcommand. + + * <> - initiate a failover between two NameNodes + + This subcommand causes a failover from the first provided NameNode to the + second. If the first NameNode is in the Standby state, this command simply + transitions the second to the Active state without error. If the first NameNode + is in the Active state, an attempt will be made to gracefully transition it to + the Standby state. If this fails, the fencing methods (as configured by + <>) will be attempted in order until one + succeeds. Only after this process will the second NameNode be transitioned to + the Active state. If no fencing method succeeds, the second NameNode will not + be transitioned to the Active state, and an error will be returned. + + * <> - determine whether the given NameNode is Active or Standby + + Connect to the provided NameNode to determine its current state, printing + either "standby" or "active" to STDOUT appropriately. This subcommand might be + used by cron jobs or monitoring scripts which need to behave differently based + on whether the NameNode is currently Active or Standby. + + * <> - check the health of the given NameNode + + Connect to the provided NameNode to check its health. The NameNode is capable + of performing some diagnostics on itself, including checking if internal + services are running as expected. This command will return 0 if the NameNode is + healthy, non-zero otherwise. One might use this command for monitoring + purposes. + + <> This is not yet implemented, and at present will always return + success, unless the given NameNode is completely down. + +* {Automatic Failover} + +** Introduction + + The above sections describe how to configure manual failover. In that mode, + the system will not automatically trigger a failover from the active to the + standby NameNode, even if the active node has failed. This section describes + how to configure and deploy automatic failover. + +** Components + + Automatic failover adds two new components to an HDFS deployment: a ZooKeeper + quorum, and the ZKFailoverController process (abbreviated as ZKFC). + + Apache ZooKeeper is a highly available service for maintaining small amounts + of coordination data, notifying clients of changes in that data, and + monitoring clients for failures. The implementation of automatic HDFS failover + relies on ZooKeeper for the following things: + + * <> - each of the NameNode machines in the cluster + maintains a persistent session in ZooKeeper. If the machine crashes, the + ZooKeeper session will expire, notifying the other NameNode that a failover + should be triggered. + + * <> - ZooKeeper provides a simple mechanism to + exclusively elect a node as active. If the current active NameNode crashes, + another node may take a special exclusive lock in ZooKeeper indicating that + it should become the next active. + + The ZKFailoverController (ZKFC) is a new component which is a ZooKeeper client + which also monitors and manages the state of the NameNode. Each of the + machines which runs a NameNode also runs a ZKFC, and that ZKFC is responsible + for: + + * <> - the ZKFC pings its local NameNode on a periodic + basis with a health-check command. So long as the NameNode responds in a + timely fashion with a healthy status, the ZKFC considers the node + healthy. If the node has crashed, frozen, or otherwise entered an unhealthy + state, the health monitor will mark it as unhealthy. + + * <> - when the local NameNode is healthy, the + ZKFC holds a session open in ZooKeeper. If the local NameNode is active, it + also holds a special "lock" znode. This lock uses ZooKeeper's support for + "ephemeral" nodes; if the session expires, the lock node will be + automatically deleted. + + * <> - if the local NameNode is healthy, and the + ZKFC sees that no other node currently holds the lock znode, it will itself + try to acquire the lock. If it succeeds, then it has "won the election", and + is responsible for running a failover to make its local NameNode active. The + failover process is similar to the manual failover described above: first, + the previous active is fenced if necessary, and then the local NameNode + transitions to active state. + + For more details on the design of automatic failover, refer to the design + document attached to HDFS-2185 on the Apache HDFS JIRA. + +** Deploying ZooKeeper + + In a typical deployment, ZooKeeper daemons are configured to run on three or + five nodes. Since ZooKeeper itself has light resource requirements, it is + acceptable to collocate the ZooKeeper nodes on the same hardware as the HDFS + NameNode and Standby Node. Many operators choose to deploy the third ZooKeeper + process on the same node as the YARN ResourceManager. It is advisable to + configure the ZooKeeper nodes to store their data on separate disk drives from + the HDFS metadata for best performance and isolation. + + The setup of ZooKeeper is out of scope for this document. We will assume that + you have set up a ZooKeeper cluster running on three or more nodes, and have + verified its correct operation by connecting using the ZK CLI. + +** Before you begin + + Before you begin configuring automatic failover, you should shut down your + cluster. It is not currently possible to transition from a manual failover + setup to an automatic failover setup while the cluster is running. + +** Configuring automatic failover + + The configuration of automatic failover requires the addition of two new + parameters to your configuration. In your <<>> file, add: + +---- + + dfs.ha.automatic-failover.enabled + true + +---- + + This specifies that the cluster should be set up for automatic failover. + In your <<>> file, add: + +---- + + ha.zookeeper.quorum + zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181 + +---- + + This lists the host-port pairs running the ZooKeeper service. + + As with the parameters described earlier in the document, these settings may + be configured on a per-nameservice basis by suffixing the configuration key + with the nameservice ID. For example, in a cluster with federation enabled, + you can explicitly enable automatic failover for only one of the nameservices + by setting <<>>. + + There are also several other configuration parameters which may be set to + control the behavior of automatic failover; however, they are not necessary + for most installations. Please refer to the configuration key specific + documentation for details. + +** Initializing HA state in ZooKeeper + + After the configuration keys have been added, the next step is to initialize + required state in ZooKeeper. You can do so by running the following command + from one of the NameNode hosts. + +---- +$ hdfs zkfc -formatZK +---- + + This will create a znode in ZooKeeper inside of which the automatic failover + system stores its data. + +** Starting the cluster with <<>> + + Since automatic failover has been enabled in the configuration, the + <<>> script will now automatically start a ZKFC daemon on any + machine that runs a NameNode. When the ZKFCs start, they will automatically + select one of the NameNodes to become active. + +** Starting the cluster manually + + If you manually manage the services on your cluster, you will need to manually + start the <<>> daemon on each of the machines that runs a NameNode. You + can start the daemon by running: + +---- +$ hadoop-daemon.sh start zkfc +---- + +** Securing access to ZooKeeper + + If you are running a secure cluster, you will likely want to ensure that the + information stored in ZooKeeper is also secured. This prevents malicious + clients from modifying the metadata in ZooKeeper or potentially triggering a + false failover. + + In order to secure the information in ZooKeeper, first add the following to + your <<>> file: + +---- + + ha.zookeeper.auth + @/path/to/zk-auth.txt + + + ha.zookeeper.acl + @/path/to/zk-acl.txt + +---- + + Please note the '@' character in these values -- this specifies that the + configurations are not inline, but rather point to a file on disk. + + The first configured file specifies a list of ZooKeeper authentications, in + the same format as used by the ZK CLI. For example, you may specify something + like: + +---- +digest:hdfs-zkfcs:mypassword +---- + ...where <<>> is a unique username for ZooKeeper, and + <<>> is some unique string used as a password. + + Next, generate a ZooKeeper ACL that corresponds to this authentication, using + a command like the following: + +---- +$ java -cp $ZK_HOME/lib/*:$ZK_HOME/zookeeper-3.4.2.jar org.apache.zookeeper.server.auth.DigestAuthenticationProvider hdfs-zkfcs:mypassword +output: hdfs-zkfcs:mypassword->hdfs-zkfcs:P/OQvnYyU/nF/mGYvB/xurX8dYs= +---- + + Copy and paste the section of this output after the '->' string into the file + <<>>, prefixed by the string "<<>>". For example: + +---- +digest:hdfs-zkfcs:vlUvLnd8MlacsE80rDuu6ONESbM=:rwcda +---- + + In order for these ACLs to take effect, you should then rerun the + <<>> command as described above. + + After doing so, you may verify the ACLs from the ZK CLI as follows: + +---- +[zk: localhost:2181(CONNECTED) 1] getAcl /hadoop-ha +'digest,'hdfs-zkfcs:vlUvLnd8MlacsE80rDuu6ONESbM= +: cdrwa +---- + +** Verifying automatic failover + + Once automatic failover has been set up, you should test its operation. To do + so, first locate the active NameNode. You can tell which node is active by + visiting the NameNode web interfaces -- each node reports its HA state at the + top of the page. + + Once you have located your active NameNode, you may cause a failure on that + node. For example, you can use <<>>> to simulate a JVM + crash. Or, you could power cycle the machine or unplug its network interface + to simulate a different kind of outage. After triggering the outage you wish + to test, the other NameNode should automatically become active within several + seconds. The amount of time required to detect a failure and trigger a + fail-over depends on the configuration of + <<>>, but defaults to 5 seconds. + + If the test does not succeed, you may have a misconfiguration. Check the logs + for the <<>> daemons as well as the NameNode daemons in order to further + diagnose the issue. + + +* Automatic Failover FAQ + + * <> + + No. On any given node you may start the ZKFC before or after its corresponding + NameNode. + + * <> + + You should add monitoring on each host that runs a NameNode to ensure that the + ZKFC remains running. In some types of ZooKeeper failures, for example, the + ZKFC may unexpectedly exit, and should be restarted to ensure that the system + is ready for automatic failover. + + Additionally, you should monitor each of the servers in the ZooKeeper + quorum. If ZooKeeper crashes, then automatic failover will not function. + + * <> + + If the ZooKeeper cluster crashes, no automatic failovers will be triggered. + However, HDFS will continue to run without any impact. When ZooKeeper is + restarted, HDFS will reconnect with no issues. + + * <> + + No. Currently, this is not supported. Whichever NameNode is started first will + become active. You may choose to start the cluster in a specific order such + that your preferred node starts first. + + * <> + + Even if automatic failover is configured, you may initiate a manual failover + using the same <<>> command. It will perform a coordinated + failover.