HADOOP-14742. Document multi-URI replication Inode for ViewFS. Contributed by Gera Shegalov

(cherry picked from commit ddb67ca707)
2018-03-12 13:42:38 -07:00 · 2018-03-12 13:42:38 -07:00 · c9364b3bce
parent 2bda1ffe72
commit c9364b3bce
1 changed files with 139 additions and 0 deletions
--- a/hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/ViewFs.md
+++ b/hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/ViewFs.md
@ -180,6 +180,145 @@ Recall that one cannot rename files or directories across namenodes or clusters
 This will NOT work in the new world if `/user` and `/data` are actually stored on different namenodes within a cluster.
 Multi-Filesystem I/0 with Nfly Mount Points
 -----------------
 HDFS and other distributed filesystems provide data resilience via some sort of
 redundancy such as block replication or more sophisticated distributed encoding.
 However, modern setups may be comprised of multiple Hadoop clusters, enterprise
 filers, hosted on and off premise. Nfly mount points make it possible for a
 single logical file to be synchronously replicated by multiple filesystems.
 It's designed for a relatively small files up to a gigabyte. In general it's a
 function of a single core/single network link performance since the logic
 resides in a single client JVM using ViewFs such as FsShell or a
 MapReduce task.
 ### Basic Configuration
 Consider the following example to understand the basic configuration of Nfly.
 Suppose we want to keep the directory `ads` replicated on three filesystems
 represented by URIs: `uri1`, `uri2` and `uri3`.
 ```xml
  <property>
    <name>fs.viewfs.mounttable.global.linkNfly../ads</name>
    <value>uri1,uri2,uri3</value>
  </property>
 ```
 Note 2 consecutive `..` in the property name. They arise because of empty
 settings for advanced tweaking of the mount point which we will show in
 subsequent sections. The property value is a comma-separated list of URIs.
 URIs may point to different clusters in different regions
 `hdfs://datacenter-east/ads`, `s3a://models-us-west/ads`, `hdfs://datacenter-west/ads`
 or in the simplest case to different directories under the same filesystem,
 e.g., `file:/tmp/ads1`, `file:/tmp/ads2`, `file:/tmp/ads3`
 All *modifications* performed under the global path `viewfs://global/ads` are
 propagated to all destination URIs if the underlying system is available.
 For instance if we create a file via hadoop shell
 ```bash
 hadoop fs -touchz viewfs://global/ads/z1
 ```
 We will find it via local filesystem in the latter configuration
 ```bash
 ls -al /tmp/ads*/z1
 -rw-r--r--  1 user  wheel  0 Mar 11 12:17 /tmp/ads1/z1
 -rw-r--r--  1 user  wheel  0 Mar 11 12:17 /tmp/ads2/z1
 -rw-r--r--  1 user  wheel  0 Mar 11 12:17 /tmp/ads3/z1
 ```
 A read from the global path is processed by the first filesystem that does not
 result in an exception. The order in which filesystems are accessed depends on
 whether they are available at this moment or and whether a topological order
 exists.
 ### Advanced Configuration
 Mount points `linkNfly` can be further configured using parameters passed as a
 comma-separated list of key=value pairs. Following parameters are currently
 supported.
 `minReplication=int` determines the minimum number of destinations that have to
 process a write modification without exceptions, if below nfly write is failed.
 It is an configuration error to have minReplication higher than the number of
 target URIs. The default is 2.
 If minReplication is lower than the number of target URIs we may have some
 target URIs without latest writes. It can be compensated by employing more
 expensive read operations controlled by the following settings
 `readMostRecent=boolean` if set to `true` causes Nfly client to check the path
 under all target URIs instead of just the first one based on the topology order.
 Among all available at the moment the one with the most recent modification time
 is processed.
 `repairOnRead=boolean` if set to `true` causes Nfly to copy most recent replica
 to stale targets such that subsequent reads can be done cheaply again from the
 closest replica.
 ### Network Topology
 Nfly seeks to satisfy reads from the "closest" target URI.
 To this end, Nfly extends the notion of
 <a href="hadoop-project-dist/hadoop-common/RackAwareness.html">Rack Awareness</a>
 to the authorities of target URIs.
 Nfly applies NetworkTopology to resolve authorities of the URIs. Most commonly
 a script based mapping is used in a heterogeneous setup. We could have a script
 providing the following topology mapping
 | URI                           | Topology                 |
 |-------------------------------|------------------------- |
 | `hdfs://datacenter-east/ads`  | /us-east/onpremise-hdfs  |
 | `s3a://models-us-west/ads`    | /us-west/aws             |
 | `hdfs://datacenter-west/ads`  | /us-west/onpremise-hdfs  |
 If a target URI does not have the authority part as in `file:/` Nfly injects
 client's local node name.
 ### Example Nfly Configuration
 ```xml
  <property>
    <name>fs.viewfs.mounttable.global.linkNfly.minReplication=3,readMostRecent=true,repairOnRead=false./ads</name>
    <value>hdfs://datacenter-east/ads,hdfs://datacenter-west/ads,s3a://models-us-west/ads,file:/tmp/ads</value>
  </property>
 ```
 ### How Nfly File Creation works
 ```java
 FileSystem fs = FileSystem.get("viewfs://global/", ...);
 FSDataOutputStream out = fs.create("viewfs://global/ads/f1");
 out.write(...);
 out.close();
 ```
 The code above would result in the following execution.
 1. create an invisible file `_nfly_tmp_f1` under each target URI i.e.,
 `hdfs://datacenter-east/ads/_nfly_tmp_f1`, `hdfs://datacenter-west/ads/_nfly_tmp_f1`, etc.
 This is done by calling `create` on underlying filesystems and returns a
 `FSDataOutputStream` object `out` that wraps all four output streams.
 2. Thus each subsequent write on `out` can be forwarded to each wrapped stream.
 3. On `out.close` all streams are closed, and the files are renamed from `_nfly_tmp_f1` to `f1`.
 All files receive the same *modification time* corresponding to the client
 system time as of beginning of this step.
 4. If at least `minReplication` destinations have gone through steps 1-3 without
 failures Nfly considers the transaction logically committed; Otherwise it tries
 to clean up the temporary files in a best-effort attempt.
 Note that because 4 is a best-effort step and the client JVM could crash and never
 resume its work, it's a good idea to provision some sort of cron job to purge such
 `_nfly_tmp` files.
 ### FAQ
 1.  **As I move from non-federated world to the federated world, I will have to keep track of namenodes for different volumes; how do I do that?**