hadoop/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS...

1074 lines
65 KiB
HTML
Raw Normal View History

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!--
| Generated by Apache Maven Doxia at 2023-03-01
| Rendered using Apache Maven Stylus Skin 1.5
-->
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Apache Hadoop 3.4.0-SNAPSHOT &#x2013; HDFS High Availability</title>
<style type="text/css" media="all">
@import url("./css/maven-base.css");
@import url("./css/maven-theme.css");
@import url("./css/site.css");
</style>
<link rel="stylesheet" href="./css/print.css" type="text/css" media="print" />
<meta name="Date-Revision-yyyymmdd" content="20230301" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
</head>
<body class="composite">
<div id="banner">
<a href="http://hadoop.apache.org/" id="bannerLeft">
<img src="http://hadoop.apache.org/images/hadoop-logo.jpg" alt="" />
</a>
<a href="http://www.apache.org/" id="bannerRight">
<img src="http://www.apache.org/images/asf_logo_wide.png" alt="" />
</a>
<div class="clear">
<hr/>
</div>
</div>
<div id="breadcrumbs">
<div class="xright"> <a href="http://wiki.apache.org/hadoop" class="externalLink">Wiki</a>
|
<a href="https://gitbox.apache.org/repos/asf/hadoop.git" class="externalLink">git</a>
|
<a href="http://hadoop.apache.org/" class="externalLink">Apache Hadoop</a>
&nbsp;| Last Published: 2023-03-01
&nbsp;| Version: 3.4.0-SNAPSHOT
</div>
<div class="clear">
<hr/>
</div>
</div>
<div id="leftColumn">
<div id="navcolumn">
<h5>General</h5>
<ul>
<li class="none">
<a href="../../index.html">Overview</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/SingleCluster.html">Single Node Setup</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/ClusterSetup.html">Cluster Setup</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/CommandsManual.html">Commands Reference</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/FileSystemShell.html">FileSystem Shell</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/Compatibility.html">Compatibility Specification</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/DownstreamDev.html">Downstream Developer's Guide</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/AdminCompatibilityGuide.html">Admin Compatibility Guide</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/InterfaceClassification.html">Interface Classification</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/filesystem/index.html">FileSystem Specification</a>
</li>
</ul>
<h5>Common</h5>
<ul>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/CLIMiniCluster.html">CLI Mini Cluster</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/FairCallQueue.html">Fair Call Queue</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/NativeLibraries.html">Native Libraries</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/Superusers.html">Proxy User</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/RackAwareness.html">Rack Awareness</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/SecureMode.html">Secure Mode</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/ServiceLevelAuth.html">Service Level Authorization</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/HttpAuthentication.html">HTTP Authentication</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/CredentialProviderAPI.html">Credential Provider API</a>
</li>
<li class="none">
<a href="../../hadoop-kms/index.html">Hadoop KMS</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/Tracing.html">Tracing</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/UnixShellGuide.html">Unix Shell Guide</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/registry/index.html">Registry</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/AsyncProfilerServlet.html">Async Profiler</a>
</li>
</ul>
<h5>HDFS</h5>
<ul>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsDesign.html">Architecture</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html">User Guide</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HDFSCommands.html">Commands Reference</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html">NameNode HA With QJM</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html">NameNode HA With NFS</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/ObserverNameNode.html">Observer NameNode</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/Federation.html">Federation</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/ViewFs.html">ViewFs</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/ViewFsOverloadScheme.html">ViewFsOverloadScheme</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsSnapshots.html">Snapshots</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsEditsViewer.html">Edits Viewer</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsImageViewer.html">Image Viewer</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html">Permissions and HDFS</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsQuotaAdminGuide.html">Quotas and HDFS</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/LibHdfs.html">libhdfs (C API)</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/WebHDFS.html">WebHDFS (REST API)</a>
</li>
<li class="none">
<a href="../../hadoop-hdfs-httpfs/index.html">HttpFS</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html">Short Circuit Local Reads</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html">Centralized Cache Management</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html">NFS Gateway</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsRollingUpgrade.html">Rolling Upgrade</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/ExtendedAttributes.html">Extended Attributes</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/TransparentEncryption.html">Transparent Encryption</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsMultihoming.html">Multihoming</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html">Storage Policies</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/MemoryStorage.html">Memory Storage Support</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/SLGUserGuide.html">Synthetic Load Generator</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.html">Erasure Coding</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HDFSDiskbalancer.html">Disk Balancer</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsUpgradeDomain.html">Upgrade Domain</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsDataNodeAdminGuide.html">DataNode Admin</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs-rbf/HDFSRouterFederation.html">Router Federation</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsProvidedStorage.html">Provided Storage</a>
</li>
</ul>
<h5>MapReduce</h5>
<ul>
<li class="none">
<a href="../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html">Tutorial</a>
</li>
<li class="none">
<a href="../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapredCommands.html">Commands Reference</a>
</li>
<li class="none">
<a href="../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html">Compatibility with 1.x</a>
</li>
<li class="none">
<a href="../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html">Encrypted Shuffle</a>
</li>
<li class="none">
<a href="../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/PluggableShuffleAndPluggableSort.html">Pluggable Shuffle/Sort</a>
</li>
<li class="none">
<a href="../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistributedCacheDeploy.html">Distributed Cache Deploy</a>
</li>
<li class="none">
<a href="../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/SharedCacheSupport.html">Support for YARN Shared Cache</a>
</li>
</ul>
<h5>MapReduce REST APIs</h5>
<ul>
<li class="none">
<a href="../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapredAppMasterRest.html">MR Application Master</a>
</li>
<li class="none">
<a href="../../hadoop-mapreduce-client/hadoop-mapreduce-client-hs/HistoryServerRest.html">MR History Server</a>
</li>
</ul>
<h5>YARN</h5>
<ul>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/YARN.html">Architecture</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/YarnCommands.html">Commands Reference</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html">Capacity Scheduler</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/FairScheduler.html">Fair Scheduler</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/ResourceManagerRestart.html">ResourceManager Restart</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/ResourceManagerHA.html">ResourceManager HA</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/ResourceModel.html">Resource Model</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/NodeLabel.html">Node Labels</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/NodeAttributes.html">Node Attributes</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/WebApplicationProxy.html">Web Application Proxy</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/TimelineServer.html">Timeline Server</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/TimelineServiceV2.html">Timeline Service V.2</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html">Writing YARN Applications</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/YarnApplicationSecurity.html">YARN Application Security</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/NodeManager.html">NodeManager</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/DockerContainers.html">Running Applications in Docker Containers</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/RuncContainers.html">Running Applications in runC Containers</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html">Using CGroups</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/SecureContainer.html">Secure Containers</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/ReservationSystem.html">Reservation System</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/GracefulDecommission.html">Graceful Decommission</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/OpportunisticContainers.html">Opportunistic Containers</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/Federation.html">YARN Federation</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/SharedCache.html">Shared Cache</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/UsingGpus.html">Using GPU</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/UsingFPGA.html">Using FPGA</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/PlacementConstraints.html">Placement Constraints</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/YarnUI2.html">YARN UI2</a>
</li>
</ul>
<h5>YARN REST APIs</h5>
<ul>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/WebServicesIntro.html">Introduction</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html">Resource Manager</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/NodeManagerRest.html">Node Manager</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/TimelineServer.html#Timeline_Server_REST_API_v1">Timeline Server</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/TimelineServiceV2.html#Timeline_Service_v.2_REST_API">Timeline Service V.2</a>
</li>
</ul>
<h5>YARN Service</h5>
<ul>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/yarn-service/Overview.html">Overview</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/yarn-service/QuickStart.html">QuickStart</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/yarn-service/Concepts.html">Concepts</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/yarn-service/YarnServiceAPI.html">Yarn Service API</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/yarn-service/ServiceDiscovery.html">Service Discovery</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/yarn-service/SystemServices.html">System Services</a>
</li>
</ul>
<h5>Hadoop Compatible File Systems</h5>
<ul>
<li class="none">
<a href="../../hadoop-aliyun/tools/hadoop-aliyun/index.html">Aliyun OSS</a>
</li>
<li class="none">
<a href="../../hadoop-aws/tools/hadoop-aws/index.html">Amazon S3</a>
</li>
<li class="none">
<a href="../../hadoop-azure/index.html">Azure Blob Storage</a>
</li>
<li class="none">
<a href="../../hadoop-azure-datalake/index.html">Azure Data Lake Storage</a>
</li>
<li class="none">
<a href="../../hadoop-cos/cloud-storage/index.html">Tencent COS</a>
</li>
<li class="none">
<a href="../../hadoop-huaweicloud/cloud-storage/index.html">Huaweicloud OBS</a>
</li>
</ul>
<h5>Auth</h5>
<ul>
<li class="none">
<a href="../../hadoop-auth/index.html">Overview</a>
</li>
<li class="none">
<a href="../../hadoop-auth/Examples.html">Examples</a>
</li>
<li class="none">
<a href="../../hadoop-auth/Configuration.html">Configuration</a>
</li>
<li class="none">
<a href="../../hadoop-auth/BuildingIt.html">Building</a>
</li>
</ul>
<h5>Tools</h5>
<ul>
<li class="none">
<a href="../../hadoop-streaming/HadoopStreaming.html">Hadoop Streaming</a>
</li>
<li class="none">
<a href="../../hadoop-archives/HadoopArchives.html">Hadoop Archives</a>
</li>
<li class="none">
<a href="../../hadoop-archive-logs/HadoopArchiveLogs.html">Hadoop Archive Logs</a>
</li>
<li class="none">
<a href="../../hadoop-distcp/DistCp.html">DistCp</a>
</li>
<li class="none">
<a href="../../hadoop-federation-balance/HDFSFederationBalance.html">HDFS Federation Balance</a>
</li>
<li class="none">
<a href="../../hadoop-gridmix/GridMix.html">GridMix</a>
</li>
<li class="none">
<a href="../../hadoop-rumen/Rumen.html">Rumen</a>
</li>
<li class="none">
<a href="../../hadoop-resourceestimator/ResourceEstimator.html">Resource Estimator Service</a>
</li>
<li class="none">
<a href="../../hadoop-sls/SchedulerLoadSimulator.html">Scheduler Load Simulator</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/Benchmarking.html">Hadoop Benchmarking</a>
</li>
<li class="none">
<a href="../../hadoop-dynamometer/Dynamometer.html">Dynamometer</a>
</li>
</ul>
<h5>Reference</h5>
<ul>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/release/">Changelog and Release Notes</a>
</li>
<li class="none">
<a href="../../api/index.html">Java API docs</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/UnixShellAPI.html">Unix Shell API</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/Metrics.html">Metrics</a>
</li>
</ul>
<h5>Configuration</h5>
<ul>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/core-default.xml">core-default.xml</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/hdfs-default.xml">hdfs-default.xml</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs-rbf/hdfs-rbf-default.xml">hdfs-rbf-default.xml</a>
</li>
<li class="none">
<a href="../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml">mapred-default.xml</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-common/yarn-default.xml">yarn-default.xml</a>
</li>
<li class="none">
<a href="../../hadoop-kms/kms-default.html">kms-default.xml</a>
</li>
<li class="none">
<a href="../../hadoop-hdfs-httpfs/httpfs-default.html">httpfs-default.xml</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/DeprecatedProperties.html">Deprecated Properties</a>
</li>
</ul>
<a href="http://maven.apache.org/" title="Built by Maven" class="poweredBy">
<img alt="Built by Maven" src="./images/logos/maven-feather.png"/>
</a>
</div>
</div>
<div id="bodyColumn">
<div id="contentBox">
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<h1>HDFS High Availability</h1>
<ul>
<li><a href="#Purpose">Purpose</a></li>
<li><a href="#Note:_Using_the_Quorum_Journal_Manager_or_Conventional_Shared_Storage">Note: Using the Quorum Journal Manager or Conventional Shared Storage</a></li>
<li><a href="#Background">Background</a></li>
<li><a href="#Architecture">Architecture</a></li>
<li><a href="#Hardware_resources">Hardware resources</a></li>
<li><a href="#Deployment">Deployment</a>
<ul>
<li><a href="#Configuration_overview">Configuration overview</a></li>
<li><a href="#Configuration_details">Configuration details</a></li>
<li><a href="#Deployment_details">Deployment details</a></li>
<li><a href="#Administrative_commands">Administrative commands</a></li></ul></li>
<li><a href="#Automatic_Failover">Automatic Failover</a>
<ul>
<li><a href="#Introduction">Introduction</a></li>
<li><a href="#Components">Components</a></li>
<li><a href="#Deploying_ZooKeeper">Deploying ZooKeeper</a></li>
<li><a href="#Before_you_begin">Before you begin</a></li>
<li><a href="#Configuring_automatic_failover">Configuring automatic failover</a></li>
<li><a href="#Initializing_HA_state_in_ZooKeeper">Initializing HA state in ZooKeeper</a></li>
<li><a href="#Starting_the_cluster_with_start-dfs.sh">Starting the cluster with start-dfs.sh</a></li>
<li><a href="#Starting_the_cluster_manually">Starting the cluster manually</a></li>
<li><a href="#Securing_access_to_ZooKeeper">Securing access to ZooKeeper</a></li>
<li><a href="#Verifying_automatic_failover">Verifying automatic failover</a></li></ul></li>
<li><a href="#Automatic_Failover_FAQ">Automatic Failover FAQ</a></li></ul>
<section>
<h2><a name="Purpose"></a>Purpose</h2>
<p>This guide provides an overview of the HDFS High Availability (HA) feature and how to configure and manage an HA HDFS cluster, using NFS for the shared storage required by the NameNodes.</p>
<p>This document assumes that the reader has a general understanding of general components and node types in an HDFS cluster. Please refer to the HDFS Architecture guide for details.</p></section><section>
<h2><a name="Note:_Using_the_Quorum_Journal_Manager_or_Conventional_Shared_Storage"></a>Note: Using the Quorum Journal Manager or Conventional Shared Storage</h2>
<p>This guide discusses how to configure and use HDFS HA using a shared NFS directory to share edit logs between the Active and Standby NameNodes. For information on how to configure HDFS HA using the Quorum Journal Manager instead of NFS, please see <a href="./HDFSHighAvailabilityWithQJM.html">this alternative guide.</a></p></section><section>
<h2><a name="Background"></a>Background</h2>
<p>Prior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster. Each cluster had a single NameNode, and if that machine or process became unavailable, the cluster as a whole would be unavailable until the NameNode was either restarted or brought up on a separate machine.</p>
<p>This impacted the total availability of the HDFS cluster in two major ways:</p>
<ul>
<li>
<p>In the case of an unplanned event such as a machine crash, the cluster would be unavailable until an operator restarted the NameNode.</p>
</li>
<li>
<p>Planned maintenance events such as software or hardware upgrades on the NameNode machine would result in windows of cluster downtime.</p>
</li>
</ul>
<p>The HDFS High Availability feature addresses the above problems by providing the option of running two (or more, as of Hadoop 3.0.0) redundant NameNodes in the same cluster in an Active/Passive configuration with a hot standby(s). This allows a fast failover to a new NameNode in the case that a machine crashes, or a graceful administrator-initiated failover for the purpose of planned maintenance.</p></section><section>
<h2><a name="Architecture"></a>Architecture</h2>
<p>In a typical HA cluster, two or more separate machines are configured as NameNodes. At any point in time, exactly one of the NameNodes is in an <i>Active</i> state, and the others are in a <i>Standby</i> state. The Active NameNode is responsible for all client operations in the cluster, while the Standby is simply acting as a slave, maintaining enough state to provide a fast failover if necessary.</p>
<p>In order for the Standby nodes to keep their state synchronized with the Active node, the current implementation requires that the nodes have access to a directory on a shared storage device (eg an NFS mount from a NAS). This restriction will likely be relaxed in future versions.</p>
<p>When any namespace modification is performed by the Active node, it durably logs a record of the modification to an edit log file stored in the shared directory. The Standby nodes are constantly watching this directory for edits, and as it sees the edits, it applies them to its own namespace. In the event of a failover, the Standby will ensure that it has read all of the edits from the shared storage before promoting itself to the Active state. This ensures that the namespace state is fully synchronized before a failover occurs.</p>
<p>In order to provide a fast failover, it is also necessary that the Standby nodes have up-to-date information regarding the location of blocks in the cluster. In order to achieve this, the DataNodes are configured with the location of all NameNodes, and send block location information and heartbeats to all the NameNodes.</p>
<p>It is vital for the correct operation of an HA cluster that only one of the NameNodes be Active at a time. Otherwise, the namespace state would quickly diverge between the two, risking data loss or other incorrect results. In order to ensure this property and prevent the so-called &#x201c;split-brain scenario,&#x201d; the administrator must configure at least one <i>fencing method</i> for the shared storage. During a failover, if it cannot be verified that the previous Active node has relinquished its Active state, the fencing process is responsible for cutting off the previous Active&#x2019;s access to the shared edits storage. This prevents it from making any further edits to the namespace, allowing the new Active to safely proceed with failover.</p></section><section>
<h2><a name="Hardware_resources"></a>Hardware resources</h2>
<p>In order to deploy an HA cluster, you should prepare the following:</p>
<ul>
<li>
<p><b>NameNode machines</b> - the machines on which you run the Active and Standby NameNodes should have equivalent hardware to each other, and equivalent hardware to what would be used in a non-HA cluster.</p>
</li>
<li>
<p><b>Shared storage</b> - you will need to have a shared directory which the NameNode machines have read/write access to. Typically this is a remote filer which supports NFS and is mounted on each of the NameNode machines. Currently only a single shared edits directory is supported. Thus, the availability of the system is limited by the availability of this shared edits directory, and therefore in order to remove all single points of failure there needs to be redundancy for the shared edits directory. Specifically, multiple network paths to the storage, and redundancy in the storage itself (disk, network, and power). Because of this, it is recommended that the shared storage server be a high-quality dedicated NAS appliance rather than a simple Linux server.</p>
</li>
</ul>
<p>Note that, in an HA cluster, the Standby NameNodes also perform checkpoints of the namespace state, and thus it is not necessary to run a Secondary NameNode, CheckpointNode, or BackupNode in an HA cluster. In fact, to do so would be an error. This also allows one who is reconfiguring a non-HA-enabled HDFS cluster to be HA-enabled to reuse the hardware which they had previously dedicated to the Secondary NameNode.</p></section><section>
<h2><a name="Deployment"></a>Deployment</h2><section>
<h3><a name="Configuration_overview"></a>Configuration overview</h3>
<p>Similar to Federation configuration, HA configuration is backward compatible and allows existing single NameNode configurations to work without change. The new configuration is designed such that all the nodes in the cluster may have the same configuration without the need for deploying different configuration files to different machines based on the type of the node.</p>
<p>Like HDFS Federation, HA clusters reuse the <code>nameservice ID</code> to identify a single HDFS instance that may in fact consist of multiple HA NameNodes. In addition, a new abstraction called <code>NameNode ID</code> is added with HA. Each distinct NameNode in the cluster has a different NameNode ID to distinguish it. To support a single configuration file for all of the NameNodes, the relevant configuration parameters are suffixed with the <b>nameservice ID</b> as well as the <b>NameNode ID</b>.</p></section><section>
<h3><a name="Configuration_details"></a>Configuration details</h3>
<p>To configure HA NameNodes, you must add several configuration options to your <b>hdfs-site.xml</b> configuration file.</p>
<p>The order in which you set these configurations is unimportant, but the values you choose for <b>dfs.nameservices</b> and <b>dfs.ha.namenodes.[nameservice ID]</b> will determine the keys of those that follow. Thus, you should decide on these values before setting the rest of the configuration options.</p>
<ul>
<li>
<p><b>dfs.nameservices</b> - the logical name for this new nameservice</p>
<p>Choose a logical name for this nameservice, for example &#x201c;mycluster&#x201d;, and use this logical name for the value of this config option. The name you choose is arbitrary. It will be used both for configuration and as the authority component of absolute HDFS paths in the cluster.</p>
<p><b>Note:</b> If you are also using HDFS Federation, this configuration setting should also include the list of other nameservices, HA or otherwise, as a comma-separated list.</p>
<div class="source">
<div class="source">
<pre>&lt;property&gt;
&lt;name&gt;dfs.nameservices&lt;/name&gt;
&lt;value&gt;mycluster&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
</li>
<li>
<p><b>dfs.ha.namenodes.[nameservice ID]</b> - unique identifiers for each NameNode in the nameservice</p>
<p>Configure with a list of comma-separated NameNode IDs. This will be used by DataNodes to determine all the NameNodes in the cluster. For example, if you used &#x201c;mycluster&#x201d; as the nameservice ID previously, and you wanted to use &#x201c;nn1&#x201d;,&#x201c;nn2&#x201d; and &#x201c;nn3&#x201d; as the individual IDs of the NameNodes, you would configure this as such:</p>
<div class="source">
<div class="source">
<pre>&lt;property&gt;
&lt;name&gt;dfs.ha.namenodes.mycluster&lt;/name&gt;
&lt;value&gt;nn1,nn2,nn3&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<p><b>Note:</b> The minimum number of NameNodes for HA is two, but you can configure more. Its suggested to not exceed 5 - with a recommended 3 NameNodes - due to communication overheads.</p>
</li>
<li>
<p><b>dfs.namenode.rpc-address.[nameservice ID].[name node ID]</b> - the fully-qualified RPC address for each NameNode to listen on</p>
<p>For both of the previously-configured NameNode IDs, set the full address and IPC port of the NameNode process. Note that this results in two separate configuration options. For example:</p>
<div class="source">
<div class="source">
<pre>&lt;property&gt;
&lt;name&gt;dfs.namenode.rpc-address.mycluster.nn1&lt;/name&gt;
&lt;value&gt;machine1.example.com:8020&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;dfs.namenode.rpc-address.mycluster.nn2&lt;/name&gt;
&lt;value&gt;machine2.example.com:8020&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;dfs.namenode.rpc-address.mycluster.nn3&lt;/name&gt;
&lt;value&gt;machine3.example.com:8020&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<p><b>Note:</b> You may similarly configure the &#x201c;<b>servicerpc-address</b>&#x201d; setting if you so desire.</p>
</li>
<li>
<p><b>dfs.namenode.http-address.[nameservice ID].[name node ID]</b> - the fully-qualified HTTP address for each NameNode to listen on</p>
<p>Similarly to <i>rpc-address</i> above, set the addresses for both NameNodes&#x2019; HTTP servers to listen on. For example:</p>
<div class="source">
<div class="source">
<pre>&lt;property&gt;
&lt;name&gt;dfs.namenode.http-address.mycluster.nn1&lt;/name&gt;
&lt;value&gt;machine1.example.com:9870&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;dfs.namenode.http-address.mycluster.nn2&lt;/name&gt;
&lt;value&gt;machine2.example.com:9870&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;dfs.namenode.http-address.mycluster.nn3&lt;/name&gt;
&lt;value&gt;machine3.example.com:9870&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<p><b>Note:</b> If you have Hadoop&#x2019;s security features enabled, you should also set the <i>https-address</i> similarly for each NameNode.</p>
</li>
<li>
<p><b>dfs.namenode.shared.edits.dir</b> - the location of the shared storage directory</p>
<p>This is where one configures the path to the remote shared edits directory which the Standby NameNodes use to stay up-to-date with all the file system changes the Active NameNode makes. <b>You should only configure one of these directories.</b> This directory should be mounted r/w on the NameNode machines. The value of this setting should be the absolute path to this directory on the NameNode machines. For example:</p>
<div class="source">
<div class="source">
<pre>&lt;property&gt;
&lt;name&gt;dfs.namenode.shared.edits.dir&lt;/name&gt;
&lt;value&gt;file:///mnt/filer1/dfs/ha-name-dir-shared&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
</li>
<li>
<p><b>dfs.client.failover.proxy.provider.[nameservice ID]</b> - the Java class that HDFS clients use to contact the Active NameNode</p>
<p>Configure the name of the Java class which will be used by the DFS Client to determine which NameNode is the current Active, and therefore which NameNode is currently serving client requests. The two implementations which currently ship with Hadoop are the <b>ConfiguredFailoverProxyProvider</b> and the <b>RequestHedgingProxyProvider</b> (which, for the first call, concurrently invokes all namenodes to determine the active one, and on subsequent requests, invokes the active namenode until a fail-over happens), so use one of these unless you are using a custom proxy provider.</p>
<div class="source">
<div class="source">
<pre>&lt;property&gt;
&lt;name&gt;dfs.client.failover.proxy.provider.mycluster&lt;/name&gt;
&lt;value&gt;org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
</li>
<li>
<p><b>dfs.ha.fencing.methods</b> - a list of scripts or Java classes which will be used to fence the Active NameNode during a failover</p>
<p>It is critical for correctness of the system that only one NameNode be in the Active state at any given time. Thus, during a failover, we first ensure that the Active NameNode is either in the Standby state, or the process has terminated, before transitioning another NameNode to the Active state. In order to do this, you must configure at least one <b>fencing method.</b> These are configured as a carriage-return-separated list, which will be attempted in order until one indicates that fencing has succeeded. There are three methods which ship with Hadoop: <i>shell</i>, <i>sshfence</i> and <i>powershell</i>. For information on implementing your own custom fencing method, see the <i>org.apache.hadoop.ha.NodeFencer</i> class.</p><hr />
<p><b>sshfence</b> - SSH to the Active NameNode and kill the process</p>
<p>The <i>sshfence</i> option SSHes to the target node and uses <i>fuser</i> to kill the process listening on the service&#x2019;s TCP port. In order for this fencing option to work, it must be able to SSH to the target node without providing a passphrase. Thus, one must also configure the <b>dfs.ha.fencing.ssh.private-key-files</b> option, which is a comma-separated list of SSH private key files. For example:</p>
<div class="source">
<div class="source">
<pre> &lt;property&gt;
&lt;name&gt;dfs.ha.fencing.methods&lt;/name&gt;
&lt;value&gt;sshfence&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;dfs.ha.fencing.ssh.private-key-files&lt;/name&gt;
&lt;value&gt;/home/exampleuser/.ssh/id_rsa&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<p>Optionally, one may configure a non-standard username or port to perform the SSH. One may also configure a timeout, in milliseconds, for the SSH, after which this fencing method will be considered to have failed. It may be configured like so:</p>
<div class="source">
<div class="source">
<pre>&lt;property&gt;
&lt;name&gt;dfs.ha.fencing.methods&lt;/name&gt;
&lt;value&gt;sshfence([[username][:port]])&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;dfs.ha.fencing.ssh.connect-timeout&lt;/name&gt;
&lt;value&gt;30000&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<hr />
<p><b>shell</b> - run an arbitrary shell command to fence the Active NameNode</p>
<p>The <i>shell</i> fencing method runs an arbitrary shell command. It may be configured like so:</p>
<div class="source">
<div class="source">
<pre>&lt;property&gt;
&lt;name&gt;dfs.ha.fencing.methods&lt;/name&gt;
&lt;value&gt;shell(/path/to/my/script.sh arg1 arg2 ...)&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<p>The string between &#x2018;(&#x2019; and &#x2018;)&#x2019; is passed directly to a bash shell and may not include any closing parentheses.</p>
<p>The shell command will be run with an environment set up to contain all of the current Hadoop configuration variables, with the &#x2018;_&#x2019; character replacing any &#x2018;.&#x2019; characters in the configuration keys. The configuration used has already had any namenode-specific configurations promoted to their generic forms &#x2013; for example <b>dfs_namenode_rpc-address</b> will contain the RPC address of the target node, even though the configuration may specify that variable as <b>dfs.namenode.rpc-address.ns1.nn1</b>.</p>
<p>Additionally, the following variables referring to the target node to be fenced are also available:</p>
<table border="0" class="bodyTable">
<thead>
<tr class="a">
<th align="left"> </th>
<th align="left"> </th></tr>
</thead><tbody>
<tr class="b">
<td align="left"> $target_host </td>
<td align="left"> hostname of the node to be fenced </td></tr>
<tr class="a">
<td align="left"> $target_port </td>
<td align="left"> IPC port of the node to be fenced </td></tr>
<tr class="b">
<td align="left"> $target_address </td>
<td align="left"> the above two, combined as host:port </td></tr>
<tr class="a">
<td align="left"> $target_nameserviceid </td>
<td align="left"> the nameservice ID of the NN to be fenced </td></tr>
<tr class="b">
<td align="left"> $target_namenodeid </td>
<td align="left"> the namenode ID of the NN to be fenced </td></tr>
</tbody>
</table>
<p>These environment variables may also be used as substitutions in the shell command itself. For example:</p>
<div class="source">
<div class="source">
<pre>&lt;property&gt;
&lt;name&gt;dfs.ha.fencing.methods&lt;/name&gt;
&lt;value&gt;shell(/path/to/my/script.sh --nameservice=$target_nameserviceid $target_host:$target_port)&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<p>If the shell command returns an exit code of 0, the fencing is determined to be successful. If it returns any other exit code, the fencing was not successful and the next fencing method in the list will be attempted.</p>
<p><b>Note:</b> This fencing method does not implement any timeout. If timeouts are necessary, they should be implemented in the shell script itself (eg by forking a subshell to kill its parent in some number of seconds).</p><hr />
<p><b>powershell</b> - use PowerShell to remotely connect to a machine and kill the required process</p>
<p>The <i>powershell</i> fencing method uses PowerShell command. It may be configured like so:</p>
<div class="source">
<div class="source">
<pre> &lt;property&gt;
&lt;name&gt;dfs.ha.fencing.methods&lt;/name&gt;
&lt;value&gt;powershell(NameNode)&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<p>The argument passed to this fencer should be a unique string in the &#x201c;CommandLine&#x201d; attribute for the &#x201c;java.exe&#x201d; process. For example, the full path for the Namenode: &#x201c;org.apache.hadoop.hdfs.server.namenode.NameNode&#x201d;. The administrator can also shorten the name to &#x201c;Namenode&#x201d; if it&#x2019;s unique.</p>
<p><b>Note:</b> This only works in Windows.</p><hr />
</li>
<li>
<p><b>fs.defaultFS</b> - the default path prefix used by the Hadoop FS client when none is given</p>
<p>Optionally, you may now configure the default path for Hadoop clients to use the new HA-enabled logical URI. If you used &#x201c;mycluster&#x201d; as the nameservice ID earlier, this will be the value of the authority portion of all of your HDFS paths. This may be configured like so, in your <b>core-site.xml</b> file:</p>
<div class="source">
<div class="source">
<pre>&lt;property&gt;
&lt;name&gt;fs.defaultFS&lt;/name&gt;
&lt;value&gt;hdfs://mycluster&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
</li>
<li>
<p><b>dfs.ha.nn.not-become-active-in-safemode</b> - if prevent safe mode namenodes to become active or observer</p>
<p>Whether allow namenode to become active when it is in safemode, when it is set to true, namenode in safemode will report SERVICE_UNHEALTHY to ZKFC if auto failover is on, or will throw exception to fail the transition to active if auto failover is off. If you transition namenode to observer state when it is in safemode, when this configuration is set to true, namenode will throw exception to fail the transition to observer. For example:</p>
<div class="source">
<div class="source">
<pre>&lt;property&gt;
&lt;name&gt;dfs.ha.nn.not-become-active-in-safemode&lt;/name&gt;
&lt;value&gt;true&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
</li>
</ul></section><section>
<h3><a name="Deployment_details"></a>Deployment details</h3>
<p>After all of the necessary configuration options have been set, one must initially synchronize the two HA NameNodes&#x2019; on-disk metadata.</p>
<ul>
<li>
<p>If you are setting up a fresh HDFS cluster, you should first run the format command (<i>hdfs namenode -format</i>) on one of NameNodes.</p>
</li>
<li>
<p>If you have already formatted the NameNode, or are converting a non-HA-enabled cluster to be HA-enabled, you should now copy over the contents of your NameNode metadata directories to the other, unformatted NameNodes by running the command &#x201c;<i>hdfs namenode -bootstrapStandby</i>&#x201d; on the unformatted NameNode. Running this command will also ensure that the shared edits directory (as configured by <b>dfs.namenode.shared.edits.dir</b>) contains sufficient edits transactions to be able to start both NameNodes.</p>
</li>
<li>
<p>If you are converting a non-HA NameNode to be HA, you should run the command &#x201c;<i>hdfs -initializeSharedEdits</i>&#x201d;, which will initialize the shared edits directory with the edits data from the local NameNode edits directories.</p>
</li>
</ul>
<p>At this point you may start all of your HA NameNodes as you normally would start a NameNode.</p>
<p>You can visit each of the NameNodes&#x2019; web pages separately by browsing to their configured HTTP addresses. You should notice that next to the configured address will be the HA state of the NameNode (either &#x201c;standby&#x201d; or &#x201c;active&#x201d;.) Whenever an HA NameNode starts, it is initially in the Standby state.</p></section><section>
<h3><a name="Administrative_commands"></a>Administrative commands</h3>
<p>Now that your HA NameNodes are configured and started, you will have access to some additional commands to administer your HA HDFS cluster. Specifically, you should familiarize yourself with all of the subcommands of the &#x201c;<i>hdfs haadmin</i>&#x201d; command. Running this command without any additional arguments will display the following usage information:</p>
<div class="source">
<div class="source">
<pre>Usage: DFSHAAdmin [-ns &lt;nameserviceId&gt;]
[-transitionToActive &lt;serviceId&gt;]
[-transitionToStandby &lt;serviceId&gt;]
[-failover [--forcefence] [--forceactive] &lt;serviceId&gt; &lt;serviceId&gt;]
[-getServiceState &lt;serviceId&gt;]
[-getAllServiceState]
[-checkHealth &lt;serviceId&gt;]
[-help &lt;command&gt;]
</pre></div></div>
<p>This guide describes high-level uses of each of these subcommands. For specific usage information of each subcommand, you should run &#x201c;<i>hdfs haadmin -help &lt;command</i>&gt;&#x201d;.</p>
<ul>
<li>
<p><b>transitionToActive</b> and <b>transitionToStandby</b> - transition the state of the given NameNode to Active or Standby</p>
<p>These subcommands cause a given NameNode to transition to the Active or Standby state, respectively. <b>These commands do not attempt to perform any fencing, and thus should rarely be used.</b> Instead, one should almost always prefer to use the &#x201c;<i>hdfs haadmin -failover</i>&#x201d; subcommand.</p>
</li>
<li>
<p><b>failover</b> - initiate a failover between two NameNodes</p>
<p>This subcommand causes a failover from the first provided NameNode to the second. If the first NameNode is in the Standby state, this command simply transitions the second to the Active state without error. If the first NameNode is in the Active state, an attempt will be made to gracefully transition it to the Standby state. If this fails, the fencing methods (as configured by <b>dfs.ha.fencing.methods</b>) will be attempted in order until one succeeds. Only after this process will the second NameNode be transitioned to the Active state. If no fencing method succeeds, the second NameNode will not be transitioned to the Active state, and an error will be returned.</p>
</li>
<li>
<p><b>getServiceState</b> - determine whether the given NameNode is Active or Standby</p>
<p>Connect to the provided NameNode to determine its current state, printing either &#x201c;standby&#x201d; or &#x201c;active&#x201d; to STDOUT appropriately. This subcommand might be used by cron jobs or monitoring scripts which need to behave differently based on whether the NameNode is currently Active or Standby.</p>
</li>
<li>
<p><b>getAllServiceState</b> - returns the state of all the NameNodes</p>
<p>Connect to the configured NameNodes to determine the current state, print either &#x201c;standby&#x201d; or &#x201c;active&#x201d; to STDOUT appropriately.</p>
</li>
<li>
<p><b>checkHealth</b> - check the health of the given NameNode</p>
<p>Connect to the provided NameNode to check its health. The NameNode is capable of performing some diagnostics on itself, including checking if internal services are running as expected. This command will return 0 if the NameNode is healthy, non-zero otherwise. One might use this command for monitoring purposes.</p>
<p><b>Note:</b> This is not yet implemented, and at present will always return success, unless the given NameNode is completely down.</p>
</li>
</ul></section></section><section>
<h2><a name="Automatic_Failover"></a>Automatic Failover</h2><section>
<h3><a name="Introduction"></a>Introduction</h3>
<p>The above sections describe how to configure manual failover. In that mode, the system will not automatically trigger a failover from the active to the standby NameNode, even if the active node has failed. This section describes how to configure and deploy automatic failover.</p></section><section>
<h3><a name="Components"></a>Components</h3>
<p>Automatic failover adds two new components to an HDFS deployment: a ZooKeeper quorum, and the ZKFailoverController process (abbreviated as ZKFC).</p>
<p>Apache ZooKeeper is a highly available service for maintaining small amounts of coordination data, notifying clients of changes in that data, and monitoring clients for failures. The implementation of automatic HDFS failover relies on ZooKeeper for the following things:</p>
<ul>
<li>
<p><b>Failure detection</b> - each of the NameNode machines in the cluster maintains a persistent session in ZooKeeper. If the machine crashes, the ZooKeeper session will expire, notifying the other NameNode that a failover should be triggered.</p>
</li>
<li>
<p><b>Active NameNode election</b> - ZooKeeper provides a simple mechanism to exclusively elect a node as active. If the current active NameNode crashes, another node may take a special exclusive lock in ZooKeeper indicating that it should become the next active.</p>
</li>
</ul>
<p>The ZKFailoverController (ZKFC) is a new component which is a ZooKeeper client which also monitors and manages the state of the NameNode. Each of the machines which runs a NameNode also runs a ZKFC, and that ZKFC is responsible for:</p>
<ul>
<li>
<p><b>Health monitoring</b> - the ZKFC pings its local NameNode on a periodic basis with a health-check command. So long as the NameNode responds in a timely fashion with a healthy status, the ZKFC considers the node healthy. If the node has crashed, frozen, or otherwise entered an unhealthy state, the health monitor will mark it as unhealthy.</p>
</li>
<li>
<p><b>ZooKeeper session management</b> - when the local NameNode is healthy, the ZKFC holds a session open in ZooKeeper. If the local NameNode is active, it also holds a special &#x201c;lock&#x201d; znode. This lock uses ZooKeeper&#x2019;s support for &#x201c;ephemeral&#x201d; nodes; if the session expires, the lock node will be automatically deleted.</p>
</li>
<li>
<p><b>ZooKeeper-based election</b> - if the local NameNode is healthy, and the ZKFC sees that no other node currently holds the lock znode, it will itself try to acquire the lock. If it succeeds, then it has &#x201c;won the election&#x201d;, and is responsible for running a failover to make its local NameNode active. The failover process is similar to the manual failover described above: first, the previous active is fenced if necessary, and then the local NameNode transitions to active state.</p>
</li>
</ul>
<p>For more details on the design of automatic failover, refer to the design document attached to HDFS-2185 on the Apache HDFS JIRA.</p></section><section>
<h3><a name="Deploying_ZooKeeper"></a>Deploying ZooKeeper</h3>
<p>In a typical deployment, ZooKeeper daemons are configured to run on three or five nodes. Since ZooKeeper itself has light resource requirements, it is acceptable to collocate the ZooKeeper nodes on the same hardware as the HDFS NameNode and Standby Node. Many operators choose to deploy the third ZooKeeper process on the same node as the YARN ResourceManager. It is advisable to configure the ZooKeeper nodes to store their data on separate disk drives from the HDFS metadata for best performance and isolation.</p>
<p>The setup of ZooKeeper is out of scope for this document. We will assume that you have set up a ZooKeeper cluster running on three or more nodes, and have verified its correct operation by connecting using the ZK CLI.</p></section><section>
<h3><a name="Before_you_begin"></a>Before you begin</h3>
<p>Before you begin configuring automatic failover, you should shut down your cluster. It is not currently possible to transition from a manual failover setup to an automatic failover setup while the cluster is running.</p></section><section>
<h3><a name="Configuring_automatic_failover"></a>Configuring automatic failover</h3>
<p>The configuration of automatic failover requires the addition of two new parameters to your configuration. In your <code>hdfs-site.xml</code> file, add:</p>
<div class="source">
<div class="source">
<pre> &lt;property&gt;
&lt;name&gt;dfs.ha.automatic-failover.enabled&lt;/name&gt;
&lt;value&gt;true&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<p>This specifies that the cluster should be set up for automatic failover. In your <code>core-site.xml</code> file, add:</p>
<div class="source">
<div class="source">
<pre> &lt;property&gt;
&lt;name&gt;ha.zookeeper.quorum&lt;/name&gt;
&lt;value&gt;zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<p>This lists the host-port pairs running the ZooKeeper service.</p>
<p>As with the parameters described earlier in the document, these settings may be configured on a per-nameservice basis by suffixing the configuration key with the nameservice ID. For example, in a cluster with federation enabled, you can explicitly enable automatic failover for only one of the nameservices by setting <code>dfs.ha.automatic-failover.enabled.my-nameservice-id</code>.</p>
<p>There are also several other configuration parameters which may be set to control the behavior of automatic failover; however, they are not necessary for most installations. Please refer to the configuration key specific documentation for details.</p></section><section>
<h3><a name="Initializing_HA_state_in_ZooKeeper"></a>Initializing HA state in ZooKeeper</h3>
<p>After the configuration keys have been added, the next step is to initialize required state in ZooKeeper. You can do so by running the following command from one of the NameNode hosts.</p>
<div class="source">
<div class="source">
<pre>[hdfs]$ $HADOOP_HOME/bin/zkfc -formatZK
</pre></div></div>
<p>This will create a znode in ZooKeeper inside of which the automatic failover system stores its data.</p></section><section>
<h3><a name="Starting_the_cluster_with_start-dfs.sh"></a>Starting the cluster with <code>start-dfs.sh</code></h3>
<p>Since automatic failover has been enabled in the configuration, the <code>start-dfs.sh</code> script will now automatically start a ZKFC daemon on any machine that runs a NameNode. When the ZKFCs start, they will automatically select one of the NameNodes to become active.</p></section><section>
<h3><a name="Starting_the_cluster_manually"></a>Starting the cluster manually</h3>
<p>If you manually manage the services on your cluster, you will need to manually start the <code>zkfc</code> daemon on each of the machines that runs a NameNode. You can start the daemon by running:</p>
<div class="source">
<div class="source">
<pre>[hdfs]$ $HADOOP_HOME/bin/hdfs --daemon start zkfc
</pre></div></div>
</section><section>
<h3><a name="Securing_access_to_ZooKeeper"></a>Securing access to ZooKeeper</h3>
<p>If you are running a secure cluster, you will likely want to ensure that the information stored in ZooKeeper is also secured. This prevents malicious clients from modifying the metadata in ZooKeeper or potentially triggering a false failover.</p>
<p>In order to secure the information in ZooKeeper, first add the following to your <code>core-site.xml</code> file:</p>
<div class="source">
<div class="source">
<pre> &lt;property&gt;
&lt;name&gt;ha.zookeeper.auth&lt;/name&gt;
&lt;value&gt;@/path/to/zk-auth.txt&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;ha.zookeeper.acl&lt;/name&gt;
&lt;value&gt;@/path/to/zk-acl.txt&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<p>Please note the &#x2018;@&#x2019; character in these values &#x2013; this specifies that the configurations are not inline, but rather point to a file on disk. The authentication info may also be read via a CredentialProvider (pls see the CredentialProviderAPI Guide in the hadoop-common project).</p>
<p>The first configured file specifies a list of ZooKeeper authentications, in the same format as used by the ZK CLI. For example, you may specify something like:</p>
<div class="source">
<div class="source">
<pre>digest:hdfs-zkfcs:mypassword
</pre></div></div>
<p>&#x2026;where <code>hdfs-zkfcs</code> is a unique username for ZooKeeper, and <code>mypassword</code> is some unique string used as a password.</p>
<p>Next, generate a ZooKeeper ACL that corresponds to this authentication, using a command like the following:</p>
<div class="source">
<div class="source">
<pre>[hdfs]$ java -cp $ZK_HOME/lib/*:$ZK_HOME/zookeeper-3.4.2.jar org.apache.zookeeper.server.auth.DigestAuthenticationProvider hdfs-zkfcs:mypassword
output: hdfs-zkfcs:mypassword-&gt;hdfs-zkfcs:P/OQvnYyU/nF/mGYvB/xurX8dYs=
</pre></div></div>
<p>Copy and paste the section of this output after the &#x2018;-&gt;&#x2019; string into the file <code>zk-acls.txt</code>, prefixed by the string &#x201c;<code>digest:</code>&#x201d;. For example:</p>
<div class="source">
<div class="source">
<pre>digest:hdfs-zkfcs:vlUvLnd8MlacsE80rDuu6ONESbM=:rwcda
</pre></div></div>
<p>In order for these ACLs to take effect, you should then rerun the <code>zkfc -formatZK</code> command as described above.</p>
<p>After doing so, you may verify the ACLs from the ZK CLI as follows:</p>
<div class="source">
<div class="source">
<pre>[zk: localhost:2181(CONNECTED) 1] getAcl /hadoop-ha
'digest,'hdfs-zkfcs:vlUvLnd8MlacsE80rDuu6ONESbM=
: cdrwa
</pre></div></div>
</section><section>
<h3><a name="Verifying_automatic_failover"></a>Verifying automatic failover</h3>
<p>Once automatic failover has been set up, you should test its operation. To do so, first locate the active NameNode. You can tell which node is active by visiting the NameNode web interfaces &#x2013; each node reports its HA state at the top of the page.</p>
<p>Once you have located your active NameNode, you may cause a failure on that node. For example, you can use <code>kill -9 &lt;pid of NN</code>&gt; to simulate a JVM crash. Or, you could power cycle the machine or unplug its network interface to simulate a different kind of outage. After triggering the outage you wish to test, the other NameNode should automatically become active within several seconds. The amount of time required to detect a failure and trigger a fail-over depends on the configuration of <code>ha.zookeeper.session-timeout.ms</code>, but defaults to 5 seconds.</p>
<p>If the test does not succeed, you may have a misconfiguration. Check the logs for the <code>zkfc</code> daemons as well as the NameNode daemons in order to further diagnose the issue.</p></section></section><section>
<h2><a name="Automatic_Failover_FAQ"></a>Automatic Failover FAQ</h2>
<ul>
<li>
<p><b>Is it important that I start the ZKFC and NameNode daemons in any particular order?</b></p>
<p>No. On any given node you may start the ZKFC before or after its corresponding NameNode.</p>
</li>
<li>
<p><b>What additional monitoring should I put in place?</b></p>
<p>You should add monitoring on each host that runs a NameNode to ensure that the ZKFC remains running. In some types of ZooKeeper failures, for example, the ZKFC may unexpectedly exit, and should be restarted to ensure that the system is ready for automatic failover.</p>
<p>Additionally, you should monitor each of the servers in the ZooKeeper quorum. If ZooKeeper crashes, then automatic failover will not function.</p>
</li>
<li>
<p><b>What happens if ZooKeeper goes down?</b></p>
<p>If the ZooKeeper cluster crashes, no automatic failovers will be triggered. However, HDFS will continue to run without any impact. When ZooKeeper is restarted, HDFS will reconnect with no issues.</p>
</li>
<li>
<p><b>Can I designate one of my NameNodes as primary/preferred?</b></p>
<p>No. Currently, this is not supported. Whichever NameNode is started first will become active. You may choose to start the cluster in a specific order such that your preferred node starts first.</p>
</li>
<li>
<p><b>How can I initiate a manual failover when automatic failover is configured?</b></p>
<p>Even if automatic failover is configured, you may initiate a manual failover using the same <code>hdfs haadmin</code> command. It will perform a coordinated failover.</p>
</li>
</ul></section>
</div>
</div>
<div class="clear">
<hr/>
</div>
<div id="footer">
<div class="xright">
&#169; 2008-2023
Apache Software Foundation
- <a href="http://maven.apache.org/privacy-policy.html">Privacy Policy</a>.
Apache Maven, Maven, Apache, the Apache feather logo, and the Apache Maven project logos are trademarks of The Apache Software Foundation.
</div>
<div class="clear">
<hr/>
</div>
</div>
</body>
</html>