diff --git a/hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/HDFSCommands.md b/hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/HDFSCommands.md index e5b94d48cb9..855ccda0c5c 100644 --- a/hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/HDFSCommands.md +++ b/hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/HDFSCommands.md @@ -439,6 +439,7 @@ Usage: hdfs haadmin -transitionToActive [--forceactive] hdfs haadmin -transitionToStandby + hdfs haadmin -transitionToObserver hdfs haadmin -failover [--forcefence] [--forceactive] hdfs haadmin -getServiceState hdfs haadmin -getAllServiceState @@ -454,6 +455,7 @@ Usage: | `-getAllServiceState` | returns the state of all the NameNodes | | | `-transitionToActive` | transition the state of the given NameNode to Active (Warning: No fencing is done) | | `-transitionToStandby` | transition the state of the given NameNode to Standby (Warning: No fencing is done) | +| `-transitionToObserver` | transition the state of the given NameNode to Observer (Warning: No fencing is done) | | `-help` [cmd] | Displays help for the given command or all commands if none is specified. | See [HDFS HA with NFS](./HDFSHighAvailabilityWithNFS.html#Administrative_commands) or [HDFS HA with QJM](./HDFSHighAvailabilityWithQJM.html#Administrative_commands) for more information on this command. diff --git a/hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/ObserverNameNode.md b/hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/ObserverNameNode.md new file mode 100644 index 00000000000..254831532d2 --- /dev/null +++ b/hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/ObserverNameNode.md @@ -0,0 +1,173 @@ + + +Consistent Reads from HDFS Observer NameNode +============================================================= + + + +Purpose +-------- + +This guide provides an overview of the HDFS Observer NameNode feature +and how to configure/install it in a typical HA-enabled cluster. For a +detailed technical design overview, please check the doc attached to +HDFS-12943. + +Background +----------- + +In a HA-enabled HDFS cluster (for more information, check +[HDFSHighAvailabilityWithQJM](./HDFSHighAvailabilityWithQJM.md)), there +is a single Active NameNode and one or more Standby NameNode(s). The +Active NameNode is responsible for serving all client requests, while +Standby NameNode just keep the up-to-date information regarding the +namespace, by tailing edit logs from JournalNodes, as well as block +location information, by receiving block reports from all the DataNodes. +One drawback of this architecture is that the Active NameNode could be a +single bottle-neck and be overloaded with client requests, especially in +a busy cluster. + +The Consistent Reads from HDFS Observer NameNode feature addresses the +above by introducing a new type of NameNode called **Observer +NameNode**. Similar to Standby NameNode, Observer NameNode keeps itself +up-to-date regarding the namespace and block location information. +In addition, it also has the ability to serve consistent reads, like +Active NameNode. Since read requests are the majority in a typical +environment, this can help to load balancing the NameNode traffic and +improve overall throughput. + +Architecture +-------------- + +In the new architecture, a HA cluster could consists of namenodes in 3 +different states: active, standby and observer. State transition can +happen between active and standby, standby and observer, but not +directly between active and observer. + +To ensure read-after-write consistency within a single client, a state +ID, which is implemented using transaction ID within NameNode, is +introduced in RPC headers. When a client performs write through Active +NameNode, it updates its state ID using the latest transaction ID from +the NameNode. When performing a subsequent read, the client passes this +state ID to Observe NameNode, which will then check against its own +transaction ID, and will ensure its own transaction ID has caught up +with the request's state ID, before serving the read request. + +Edit log tailing is critical for Observer NameNode as it directly affects +the latency between when a transaction is applied in Active NameNode and +when it is applied in the Observer NameNode. A new edit log tailing +mechanism, named "Edit Tailing Fast-Path", is introduced to +significantly reduce this latency. This is built on top of the existing +in-progress edit log tailing feature, with further improvements such as +RPC-based tailing instead of HTTP, a in-memory cache on the JournalNode, +etc. For more details, please see the design doc attached to HDFS-13150. + +New client-side proxy providers are also introduced. +ObserverReadProxyProvider, which inherits the existing +ConfiguredFailoverProxyProvider, should be used to replace the latter to +enable reads from Observer NameNode. When submitting a client read +request, the proxy provider will first try each Observer NameNode +available in the cluster, and only fall back to Active NameNode if all +of the former failed. Similarly, ObserverReadProxyProviderWithIPFailover +is introduced to replace IPFailoverProxyProvider in a IP failover setup. + +Deployment +----------- + +### Configurations + +To enable consistent reads from Observer NameNode, you'll need to add a +few configurations to your **hdfs-site.xml**: + +* **dfs.ha.tail-edits.in-progress** - to enable fast tailing on + in-progress edit logs. + + This enables fast edit log tailing through in-progress edit logs and + also other mechanisms such as RPC-based edit log fetching, in-memory + cache in JournalNodes, and so on. It is disabled by default, but is + **required to be turned on** for the Observer NameNode feature. + + + dfs.ha.tail-edits.in-progress + true + + +* **dfs.journalnode.edit-cache-size.bytes** - the in-memory cache size, + in bytes, on the JournalNodes. + + This is the size, in bytes, of the in-memory cache for storing edits + on the JournalNode side. The cache is used for serving edits via + RPC-based tailing. This is only effective when + dfs.ha.tail-edits.in-progress is turned on. + + + dfs.journalnode.edit-cache-size.bytes + 1048576 + + +### New administrative command + +A new HA admin command is introduced to transition a Standby NameNode +into observer state: + + haadmin -transitionToObserver + +Note this can only be executed on Standby NameNode. Exception will be +thrown when invoking this on Active NameNode. + +Similarly, existing **transitionToStandby** can also be run on an +Observer NameNode, which transition it to the standby state. + +**NOTE**: the feature for Observer NameNode to participate in failover +is not implemented yet. Therefore, as described in the next section, you +should only use **transitionToObserver** to bring up an observer and put +it outside the ZooKeeper controlled failover group. You should not use +**transitionToStandby** since the host for the Observer NameNode cannot +have ZKFC running. + +### Deployment details + +To enable observer support, first you'll need a HA-enabled HDFS cluster +with more than 2 namenodes. Then, you need to transition Standby +NameNode(s) into the observer state. An minimum setup would be running 3 +namenodes in the cluster, one active, one standby and one observer. For +large HDFS clusters we recommend running two or more Observers depending +on the intensity of read requests and HA requirements. + +Note that currently Observer NameNode doesn't integrate fully when +automatic failover is enabled. If the +**dfs.ha.automatic-failover.enabled** is turned on, you'll also need to +disable ZKFC on the namenode for observer. In addition to that, you'll +also need to add **forcemanual** flag to the **transitionToObserver** +command: + + haadmin -transitionToObserver -forcemanual + +In future, this restriction will be lifted. + +### Client configuration + +Clients who wish to use Observer NameNode for read accesses can +specify the ObserverReadProxyProvider class for proxy provider +implementation, in the client-side **hdfs-site.xml** configuration file: + + + dfs.client.failover.proxy.provider. + org.apache.hadoop.hdfs.server.namenode.ha.ObserverReadProxyProvider + + +Clients who do not wish to use Observer NameNode can still use the +existing ConfiguredFailoverProxyProvider and should not see any behavior +change.