From c9364b3bce30946a6c0154974d0adc6accb3bea3 Mon Sep 17 00:00:00 2001 From: Chris Douglas Date: Mon, 12 Mar 2018 13:42:38 -0700 Subject: [PATCH] HADOOP-14742. Document multi-URI replication Inode for ViewFS. Contributed by Gera Shegalov (cherry picked from commit ddb67ca707de896cd0ba5cda3c0d1a2d9edca968) --- .../hadoop-hdfs/src/site/markdown/ViewFs.md | 139 ++++++++++++++++++ 1 file changed, 139 insertions(+) diff --git a/hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/ViewFs.md b/hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/ViewFs.md index 10085832566..f851ef6a656 100644 --- a/hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/ViewFs.md +++ b/hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/ViewFs.md @@ -180,6 +180,145 @@ Recall that one cannot rename files or directories across namenodes or clusters This will NOT work in the new world if `/user` and `/data` are actually stored on different namenodes within a cluster. +Multi-Filesystem I/0 with Nfly Mount Points +----------------- + +HDFS and other distributed filesystems provide data resilience via some sort of +redundancy such as block replication or more sophisticated distributed encoding. +However, modern setups may be comprised of multiple Hadoop clusters, enterprise +filers, hosted on and off premise. Nfly mount points make it possible for a +single logical file to be synchronously replicated by multiple filesystems. +It's designed for a relatively small files up to a gigabyte. In general it's a +function of a single core/single network link performance since the logic +resides in a single client JVM using ViewFs such as FsShell or a +MapReduce task. + +### Basic Configuration + +Consider the following example to understand the basic configuration of Nfly. +Suppose we want to keep the directory `ads` replicated on three filesystems +represented by URIs: `uri1`, `uri2` and `uri3`. + +```xml + + fs.viewfs.mounttable.global.linkNfly../ads + uri1,uri2,uri3 + +``` +Note 2 consecutive `..` in the property name. They arise because of empty +settings for advanced tweaking of the mount point which we will show in +subsequent sections. The property value is a comma-separated list of URIs. + +URIs may point to different clusters in different regions +`hdfs://datacenter-east/ads`, `s3a://models-us-west/ads`, `hdfs://datacenter-west/ads` +or in the simplest case to different directories under the same filesystem, +e.g., `file:/tmp/ads1`, `file:/tmp/ads2`, `file:/tmp/ads3` + +All *modifications* performed under the global path `viewfs://global/ads` are +propagated to all destination URIs if the underlying system is available. + +For instance if we create a file via hadoop shell +```bash +hadoop fs -touchz viewfs://global/ads/z1 +``` + +We will find it via local filesystem in the latter configuration +```bash +ls -al /tmp/ads*/z1 +-rw-r--r-- 1 user wheel 0 Mar 11 12:17 /tmp/ads1/z1 +-rw-r--r-- 1 user wheel 0 Mar 11 12:17 /tmp/ads2/z1 +-rw-r--r-- 1 user wheel 0 Mar 11 12:17 /tmp/ads3/z1 +``` + +A read from the global path is processed by the first filesystem that does not +result in an exception. The order in which filesystems are accessed depends on +whether they are available at this moment or and whether a topological order +exists. + +### Advanced Configuration + +Mount points `linkNfly` can be further configured using parameters passed as a +comma-separated list of key=value pairs. Following parameters are currently +supported. + +`minReplication=int` determines the minimum number of destinations that have to +process a write modification without exceptions, if below nfly write is failed. +It is an configuration error to have minReplication higher than the number of +target URIs. The default is 2. + +If minReplication is lower than the number of target URIs we may have some +target URIs without latest writes. It can be compensated by employing more +expensive read operations controlled by the following settings + +`readMostRecent=boolean` if set to `true` causes Nfly client to check the path +under all target URIs instead of just the first one based on the topology order. +Among all available at the moment the one with the most recent modification time +is processed. + +`repairOnRead=boolean` if set to `true` causes Nfly to copy most recent replica +to stale targets such that subsequent reads can be done cheaply again from the +closest replica. + +### Network Topology + +Nfly seeks to satisfy reads from the "closest" target URI. + +To this end, Nfly extends the notion of +Rack Awareness +to the authorities of target URIs. + +Nfly applies NetworkTopology to resolve authorities of the URIs. Most commonly +a script based mapping is used in a heterogeneous setup. We could have a script +providing the following topology mapping + +| URI | Topology | +|-------------------------------|------------------------- | +| `hdfs://datacenter-east/ads` | /us-east/onpremise-hdfs | +| `s3a://models-us-west/ads` | /us-west/aws | +| `hdfs://datacenter-west/ads` | /us-west/onpremise-hdfs | + + +If a target URI does not have the authority part as in `file:/` Nfly injects +client's local node name. + +### Example Nfly Configuration + +```xml + + fs.viewfs.mounttable.global.linkNfly.minReplication=3,readMostRecent=true,repairOnRead=false./ads + hdfs://datacenter-east/ads,hdfs://datacenter-west/ads,s3a://models-us-west/ads,file:/tmp/ads + +``` + +### How Nfly File Creation works + +```java +FileSystem fs = FileSystem.get("viewfs://global/", ...); +FSDataOutputStream out = fs.create("viewfs://global/ads/f1"); +out.write(...); +out.close(); +``` +The code above would result in the following execution. + +1. create an invisible file `_nfly_tmp_f1` under each target URI i.e., +`hdfs://datacenter-east/ads/_nfly_tmp_f1`, `hdfs://datacenter-west/ads/_nfly_tmp_f1`, etc. +This is done by calling `create` on underlying filesystems and returns a +`FSDataOutputStream` object `out` that wraps all four output streams. + +2. Thus each subsequent write on `out` can be forwarded to each wrapped stream. + +3. On `out.close` all streams are closed, and the files are renamed from `_nfly_tmp_f1` to `f1`. +All files receive the same *modification time* corresponding to the client +system time as of beginning of this step. + +4. If at least `minReplication` destinations have gone through steps 1-3 without +failures Nfly considers the transaction logically committed; Otherwise it tries +to clean up the temporary files in a best-effort attempt. + +Note that because 4 is a best-effort step and the client JVM could crash and never +resume its work, it's a good idea to provision some sort of cron job to purge such +`_nfly_tmp` files. + ### FAQ 1. **As I move from non-federated world to the federated world, I will have to keep track of namenodes for different volumes; how do I do that?**