HADOOP-14742. Document multi-URI replication Inode for ViewFS. Contributed by Gera Shegalov

(cherry picked from commit ddb67ca707)
This commit is contained in:
Chris Douglas 2018-03-12 13:42:38 -07:00
parent 2bda1ffe72
commit c9364b3bce
1 changed files with 139 additions and 0 deletions

View File

@ -180,6 +180,145 @@ Recall that one cannot rename files or directories across namenodes or clusters
This will NOT work in the new world if `/user` and `/data` are actually stored on different namenodes within a cluster. This will NOT work in the new world if `/user` and `/data` are actually stored on different namenodes within a cluster.
Multi-Filesystem I/0 with Nfly Mount Points
-----------------
HDFS and other distributed filesystems provide data resilience via some sort of
redundancy such as block replication or more sophisticated distributed encoding.
However, modern setups may be comprised of multiple Hadoop clusters, enterprise
filers, hosted on and off premise. Nfly mount points make it possible for a
single logical file to be synchronously replicated by multiple filesystems.
It's designed for a relatively small files up to a gigabyte. In general it's a
function of a single core/single network link performance since the logic
resides in a single client JVM using ViewFs such as FsShell or a
MapReduce task.
### Basic Configuration
Consider the following example to understand the basic configuration of Nfly.
Suppose we want to keep the directory `ads` replicated on three filesystems
represented by URIs: `uri1`, `uri2` and `uri3`.
```xml
<property>
<name>fs.viewfs.mounttable.global.linkNfly../ads</name>
<value>uri1,uri2,uri3</value>
</property>
```
Note 2 consecutive `..` in the property name. They arise because of empty
settings for advanced tweaking of the mount point which we will show in
subsequent sections. The property value is a comma-separated list of URIs.
URIs may point to different clusters in different regions
`hdfs://datacenter-east/ads`, `s3a://models-us-west/ads`, `hdfs://datacenter-west/ads`
or in the simplest case to different directories under the same filesystem,
e.g., `file:/tmp/ads1`, `file:/tmp/ads2`, `file:/tmp/ads3`
All *modifications* performed under the global path `viewfs://global/ads` are
propagated to all destination URIs if the underlying system is available.
For instance if we create a file via hadoop shell
```bash
hadoop fs -touchz viewfs://global/ads/z1
```
We will find it via local filesystem in the latter configuration
```bash
ls -al /tmp/ads*/z1
-rw-r--r-- 1 user wheel 0 Mar 11 12:17 /tmp/ads1/z1
-rw-r--r-- 1 user wheel 0 Mar 11 12:17 /tmp/ads2/z1
-rw-r--r-- 1 user wheel 0 Mar 11 12:17 /tmp/ads3/z1
```
A read from the global path is processed by the first filesystem that does not
result in an exception. The order in which filesystems are accessed depends on
whether they are available at this moment or and whether a topological order
exists.
### Advanced Configuration
Mount points `linkNfly` can be further configured using parameters passed as a
comma-separated list of key=value pairs. Following parameters are currently
supported.
`minReplication=int` determines the minimum number of destinations that have to
process a write modification without exceptions, if below nfly write is failed.
It is an configuration error to have minReplication higher than the number of
target URIs. The default is 2.
If minReplication is lower than the number of target URIs we may have some
target URIs without latest writes. It can be compensated by employing more
expensive read operations controlled by the following settings
`readMostRecent=boolean` if set to `true` causes Nfly client to check the path
under all target URIs instead of just the first one based on the topology order.
Among all available at the moment the one with the most recent modification time
is processed.
`repairOnRead=boolean` if set to `true` causes Nfly to copy most recent replica
to stale targets such that subsequent reads can be done cheaply again from the
closest replica.
### Network Topology
Nfly seeks to satisfy reads from the "closest" target URI.
To this end, Nfly extends the notion of
<a href="hadoop-project-dist/hadoop-common/RackAwareness.html">Rack Awareness</a>
to the authorities of target URIs.
Nfly applies NetworkTopology to resolve authorities of the URIs. Most commonly
a script based mapping is used in a heterogeneous setup. We could have a script
providing the following topology mapping
| URI | Topology |
|-------------------------------|------------------------- |
| `hdfs://datacenter-east/ads` | /us-east/onpremise-hdfs |
| `s3a://models-us-west/ads` | /us-west/aws |
| `hdfs://datacenter-west/ads` | /us-west/onpremise-hdfs |
If a target URI does not have the authority part as in `file:/` Nfly injects
client's local node name.
### Example Nfly Configuration
```xml
<property>
<name>fs.viewfs.mounttable.global.linkNfly.minReplication=3,readMostRecent=true,repairOnRead=false./ads</name>
<value>hdfs://datacenter-east/ads,hdfs://datacenter-west/ads,s3a://models-us-west/ads,file:/tmp/ads</value>
</property>
```
### How Nfly File Creation works
```java
FileSystem fs = FileSystem.get("viewfs://global/", ...);
FSDataOutputStream out = fs.create("viewfs://global/ads/f1");
out.write(...);
out.close();
```
The code above would result in the following execution.
1. create an invisible file `_nfly_tmp_f1` under each target URI i.e.,
`hdfs://datacenter-east/ads/_nfly_tmp_f1`, `hdfs://datacenter-west/ads/_nfly_tmp_f1`, etc.
This is done by calling `create` on underlying filesystems and returns a
`FSDataOutputStream` object `out` that wraps all four output streams.
2. Thus each subsequent write on `out` can be forwarded to each wrapped stream.
3. On `out.close` all streams are closed, and the files are renamed from `_nfly_tmp_f1` to `f1`.
All files receive the same *modification time* corresponding to the client
system time as of beginning of this step.
4. If at least `minReplication` destinations have gone through steps 1-3 without
failures Nfly considers the transaction logically committed; Otherwise it tries
to clean up the temporary files in a best-effort attempt.
Note that because 4 is a best-effort step and the client JVM could crash and never
resume its work, it's a good idea to provision some sort of cron job to purge such
`_nfly_tmp` files.
### FAQ ### FAQ
1. **As I move from non-federated world to the federated world, I will have to keep track of namenodes for different volumes; how do I do that?** 1. **As I move from non-federated world to the federated world, I will have to keep track of namenodes for different volumes; how do I do that?**